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Accountable and transparent 


The US government has changed how biomedical scientists disclose their financial interests. The 
revised rules are welcome, but Internet access to the identified conflicts should be a requirement. 


financial interests came into force last month. The changes, 
which affect scientists who receive grants from the government, 
are welcome — although in one respect they do not go far enough. 

About 38,000 researchers, most of them recipients of grants from 
the US National Institutes of Health (NIH), the world’s largest med- 
ical-research funder, will need to comply with the beefed-up rules. 
The changes update regulations put in place in 1995 to ensure that 
investigator bias doesn't sway the design, conduct or reporting of 
research. 

There are several important changes. First, investigators must now 
disclose to their institutions every “significant financial interest” 
belonging to themselves or their immediate family that is related to 
any of their institutional responsibilities — from teaching and seeing 
patients to lab research and service on ethics committees. This require- 
ment appropriately casts a broader net than the previous rules, which 
generally asked for disclosure on only a project-specific basis. 

The change ends ambiguity that, for instance, might have allowed 
a researcher to conclude that paid service on the board of a major 
pharmaceutical company drew only on clinical expertise, and there- 
fore was not relevant to a government-funded research project that 
used one of the company’s experimental compounds. Under the 
updated rules, there will be no question that such income must be 
disclosed, and institutions will have a more complete picture of their 
scientists’ potentially relevant financial interests. 

It takes only one example to drive home the significance of this 
change. Between January 2000 and January 2006, high-profile 
psychiatrist Charles Nemeroff, then at Emory University in Atlanta, 
Georgia, received more than US$800,000 in payments from drug- 
maker GlaxoSmithKline for over 250 speeches that he gave to psy- 
chiatrists. He failed to disclose this income to Emory administrators. 
After being discovered, Nemeroff argued that the rules on whether 
such income was reportable were ambiguous. 

The tougher rules, crucially, give institutions prime responsibil- 
ity for determining whether a given financial interest — company- 
paid speaking honoraria, consulting fees, paid authorship, travel 
reimbursements and stock ownership all qualify — is related to a 
government-funded grant, and whether it constitutes a conflict. Under 
the old regime, the scientist was charged with deciding whether a given 
interest was related to the research and thus whether it was reportable. 
That arrangement did not inspire confidence — a problem in an era 
in which public trust in the medical enterprise is at risk and must be 
built, not undermined. 

The updated rules also lower the threshold at which an interest is 
defined as significant, from $10,000 under the old rules to $5,000. In 
a moribund economy with many US taxpayers struggling to make 
ends meet, this is fitting. 

The rules have also been strengthened in other important ways. 


[inci rules for how US biomedical scientists report 


For instance, far more detail will now be reported by institutions to 
the NIH about each identified conflict, including the approximate 
dollar value of the interest and the measures being taken to manage or 
eliminate the conflict. There is also, importantly, an explicit exception 
to the disclosure requirements for income that scientists earn from 
universities or government agencies for teaching, serving on advisory 
or review panels and giving seminars or lectures. 
The new rules fall down, however, in one significant regard. When 
it first published the proposed changes, the 


“Public trust NIH described what it called “an important 
in the medical and significant new requirement to ... under- 
enterprise is at score our commitment to fostering transpar- 
risk and must ency, accountability, and public trust”. 

be built, not That requirement was that institutions 


undermined.” would post details of their investigators’ 
financial conflicts of interest on a publicly 
accessible website that was updated every year. In the final iteration 
of the new rules, the website has been made optional, and institutions 
faced with requests for information may instead respond in writing, 
within five business days. This is an outdated approach to transpar- 
ency. It will not advance the public’ faith in timely, comprehensive 
and truly accessible disclosure, at a time when the boundary between 
academia and industry has become ever more porous, and when the 
average citizen's trust in government-funded medical research is ever 
more crucial. The NIH should revise the rules again to make the 
website mandatory. It is within the agency’s power to insist on this 
standard, and it is the right thing to do. m 


Spinning threads 


Publication of ENCODE data drives 
innovation in data mining. 


highlighter pen to mark the most interesting parts ofa research 

paper, report, proposal or (librarians look away) book. It is a 
natural reaction when faced with a swamp of information — to build 
islands of focus that can be identified and linked, both in print and 
in the mind. 

This week, Nature introduces a new concept in the publishing and 
dissemination of scientific information: one that is a response to the 
increasing complexity of modern research, and one that draws heavily 
on the contribution of the humble highlighter. 

Starting on page 45, we publish a package of material that centres 


kz can be few scientists who have not used a brightly coloured 
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on the results from the ENCODE consortium, including 6 of the 
30 papers the project has produced. The ENCODE — Encyclopedia 
of DNA Elements — consortium set out to describe all the functional 
elements in the human genome. Their headline conclusion: more than 
80% of the human genome’s components have now been assigned at 
least one biochemical function. 

The six papers that Nature publishes (the others appear simultane- 
ously in Genome Research and Genome Biology) may look like conven- 
tional research reports, but in the digital world they begin to take on 
new form — as themed threads. If you are reading this online, then 
click on the link. If you are reading it in print, then have a look at the 
version on Nature’s ENCODE explorer website (www.nature.com/ 
encode) or, better still, the iPad app. 

As part of the publication process, the ENCODE authors asked for 
something extra: to select and package together the sections from each 
paper that will be of particular interest to scientists in various and 
varied fields. Just as a postdoctoral researcher looking at transcription 
factors would use a highlighter to mark up different bits of the papers 
from, say, a colleague looking at DNA methylation, so the ENCODE 
authors thought that researchers across the biological spectrum would 
want to be able to pull together pieces from each of the digital versions 
that were of specific interest to them. Our editors agreed, and the result 
is 13 online threads — biological themes that contain no original mate- 
rial but instead harvest and combine related paragraphs, figures and 
tables from the 30 papers. 

The threads, we hope, will help readers to make sense of the dizzying 
amounts of data produced during the five years of the main ENCODE 
effort. And they should allow scientists to exploit more easily the infor- 
mation in their own studies, and that, after all, was the point of the 
project in the first place. Presented online, the threads are filled with 
links that allow readers to jump easily from paper to paper, to see where 
the information comes from and how the data are interconnected. 


Alongside the thread concept, the ENCODE package introduces 
another technical innovation, at least one new to Nature. Using a ‘vir- 
tual machine; online readers can access software designed to perform 
set computational functions on some of the ENCODE data. 

The idea is to allow readers to recreate the analyses behind the spe- 
cific aspects of the paper and to see how the outcome changes when 
specific parameters are tweaked. Think of it as a bridge that links the 
data, the analysis and the relevant description 
and discussion in the formal papers. 

Weare eager to hear what readers and users 
of the material think of these approaches. If 
they are useful, and early feedback suggests 
that they will be, then scientists who work on 
other similarly data-rich and analysis-heavy 
projects should take note. Results from pro- 
jects that aim to sequence the human microbiome or different forms 
of cancer, for example, produce piles of data that could be split along 
many different themes, and so divided into threads. In many cases 
the true hard work — the science — is done. Threads, then, are just a 
different way to package the results. 

Some practical problems remain in applying these ideas more 
widely. The thread concept depends on cooperation between pub- 
lishers, as well as open access to the papers and appropriate copyright 
agreements. And the virtual machine demands well curated data that 
are available to all. 

Why are there 13 ENCODE threads? Good question, there could 
have been many more — as many as there are questions raised in 
the minds of scientists by the mass of information that the project 
has placed at their disposal. If your particular interest or angle is not 
already selected and presented as a theme, then apologies — there is 
always the old-fashioned route: print the papers and attack them with 
a highlighter. = 


“Scientists who 
work on other 
data-rich and 
analysis-heavy 
projects should 
take note.” 


Moonlight drive 


The data from the ageing Voyager probes are 
illuminating the edge of the Solar System. 


A press release from the agency last month stated that the twin 

Voyager spacecraft were poised to Break on Through to the 
Other Side — referring to the probes’ approach to the edge of the 
Solar System, but also to a 1967 hit from the US band The Doors. 
NASA pointed out to journalists that the missions were launched 
35 years ago and was no doubt hoping for some (more) positive 
coverage to mark the anniversary. What’s more, on 13 August, Voy- 
ager 2 became the longest-operating spacecraft, beating the record 
of Pioneer 6, which was launched in December 1965 and returned 
its final signal some 12,758 days later. (Voyager 2, counterintuitively, 
was launched two weeks before Voyager 1, but the latter is now the 
farthest from the Sun.) 

The spin doctors can be excused this time. Voyager is a truly great 
mission, and one that reporters still find hard to resist — some of them 
have been happily writing about its discoveries ever since the two craft 
launched in 1977. It is the science story that keeps on giving: the deep, 
hazy atmosphere of Saturn’s moon Titan; the volcanoes of Jupiter's moon 
Io; the large, unusual magnetic field of Uranus; and the geysers of Triton, 
the frozen world that orbits Neptune — all discovered and lapped up by 
an eager public as the probes skimmed past the outer planets. 

Still, their work is not done. Even though the probes are now more 
than 15 billion kilometres away from the Sun, their handlers on the 
ground remain in near-daily contact, as the spacecraft continue to 


. \ omeone in the NASA media-relations office knows their music. 
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send back useful information — now about the farthest reaches of the 
Solar System. Last year, NASA even coaxed the ageing and radiation- 
blasted parts of Voyager 1 into performing a series of rolls to have a 
proper look around. It was curious because some of the data being sent 
back from the spacecraft seemed to suggest that the edge of the Solar 
System was nearby. Levels of high-energy cosmic rays, which origi- 
nate far beyond our corner of space, had spiked. And the number of 
lower-energy particles that come from closer to home seemed to dip. 

The results of the latest tests, which are published on page 124, have 
surprised many. If Voyager 1 truly is near the point where the helio- 
sphere — the bubble of charged particles from the Sun — fades to 
interstellar grey, then it should have found solar particles that have 
been buffeted by the winds of deep space, generated by supernovae 
that exploded long ago elsewhere in the Galaxy. In fact, the particles 
it found had effectively been becalmed. 

The implications of the discovery for our understanding of the 
structure of the Solar System, and how it changes as it whizzes through 
space, are profound. As a News story on page 20 explains, the find 
could mean that astronomers will have to rethink their models of the 
heliopause, the boundary at which the outward pressure of the helio- 
sphere is balanced by the inward push of outer space. Or it could mean 
that Voyager 1 is still some distance from the heliopause. 

That would no doubt disappoint the NASA press office, which is eager 
to announce that at least one probe has entered a new realm of discovery 
— and before the batteries of the spacecraft run out, in a decade or so. 
But it should not lose heart. Like the Voyager probes, The Doors are still 
going, albeit not as strongly and with their best work probably behind 
them. If the heliopause is farther away than we 
thought, and the reach of the solar wind longer 
than we realized, then the Voyager twins still have 
many years remaining as Riders on the Storm, 
and some way to go before they reach The End. m 
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retractions is rising, new examples of poor oversight or practice 

are being uncovered and anxiety is building among researchers. 
Those of us who work in the life sciences are discovering that some 
of our basic premises are flawed or inaccurate — cell lines have been 
misidentified and drug metabolism in animal models misjudged. Even 
high-profile findings have been questioned. Building on solid founda- 
tions was an architectural principle understood by the ancient Greeks 
and Egyptians, yet we may be constructing our castles on swampland. 
Is it a surprise that clinical translation fails so often? 

Although most mistakes are unintentional and sometimes unavoid- 
able, there are also deliberate efforts to deceive. Scientists (especially 
those of us in biomedical research) must do more to detect and be 
seen to correct errors as an on-going imperative. 

We scientists must recognize that, to the pub- 
lic and politicians, we are a privileged and elite 
group. The products of our work are largely 
incomprehensible to non-experts — and even to 
colleagues on the periphery of the same field. Like 
an iconoclastic gentlemen's club, our community 
has developed rules and etiquette to maintain 
order. But, unlike a club, our sponsorship fees are 
paid by taxpayers and philanthropic donations. 

The scientific community must be diligent 
in highlighting abuses, develop greater trans- 
parency and accessibility for its work, police 
research more effectively and exemplify laud- 
able behaviour. This includes encouraging more 
open debate about misconduct and malpractice, 
exposing our dirty laundry and welcoming exter- 
nal examination. A good example of this, the 
website Retraction Watch (retractionwatch.wordpress.com), shines 
light on problems with papers and, by doing so, educates and cel- 
ebrates research ethics and good practice. Peer pressure is a powerful 
tool — but only if peers are aware of infractions and bad practice. 

We might also better foster and acknowledge aspects of research 
that are often overlooked. Efficient reagent exchange and sharing, 
for example, protects against cheats and can help to correct more 
common, unintentional errors. 

The inherent uncertainty of research provides a safe haven for data 
omission, manipulation or exaggeration. Because interpretation of 
data is an imperfect science, there are few consequences for those 
tempted to oversell their findings. On the contrary, such faulty embel- 
lishment can help to determine whether a study is published — and 
where. Moreover, because failure to reproduce a 


r | There is increasing unrest in global science. The number of 


published finding can be due to innocent factors, NATURE.COM 
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AS SOCIAL MEDIA AND 

BLOGS, ARE HELPING 
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VEIL OF SECRECY 


OVER SCIENCE. 


We must be open about 
our mistakes 


Greater transparency about the scientific process and a closer focus on 
correcting defective data are the way forward, says Jim Woodgett. 


dead-end data that pollute the literature and waste precious resources. 

To counter this, barriers to correction of the public record should be 
low but rigorous. Publication of refutations or modifications should be 
encouraged by journals and funding agencies. One may argue that ifa 
study is ignored it does no harm, but superfluous publication clutter is 
not benign. Minimally, it adds chaff to the wheat, but it also promotes 
mediocrity by example. More importantly, it provides meticulously 
documented evidence of apparent waste to funders and the public. 

Ina culture of publish or perish, the continuing growth in the num- 
ber of scientific journals is hardly a surprise. But does this proliferation 
of papers reflect better science, or merely dilution? When a third ofall 
papers are never cited, it is reasonable to question why so many are 
published. If the answer is simply as a form of accepted currency to 
indicate productivity, then our evaluative systems 
must become less reliant on publication quanta. 

Before we complain legitimately about grant 
success rates and funding pressures, we must 
ensure that our own house is in order. The act 
of publishing takes significant effort, yet we still 
publish low-impact studies as the required unit 
of research. We must learn to stop publishing 
everything and find other ways to document and 
recognize our studies, such as searchable publica- 
tion of theses, meeting proceedings and posters. 

And take the way most scientists access money 
from the public purse. Despite being the conduit 
to research funds, grant proposals undergo lim- 
ited vetting of their content. Unlike manuscripts 
that pass peer review, these documents are treated 
as confidential, so their writers are difficult to hold 
to account. There are legitimate concerns about 
intellectual property and fear of being scooped by competitors, but why 
not make such documents public after a period of time? Indeed, some 
scientists are already publishing their grant applications on the Internet, 
ostensibly to help educate new researchers. But this also allows valida- 
tion and cross-checking and sets a new bar for transparency. 

Other searchable Internet technologies, such as social media, blogs, 
slide-sharing sites and even video-sharing sites such as YouTube, are 
helping to lift the veil of secrecy over science. This increased transpar- 
ency, associated with wider access and discussion, is a powerful weapon 
to reduce scientific misinformation of all sorts — and one that all hon- 
est and careful scientists should embrace. Transgressions and errors 
will be more quickly detected and more widely communicated when 
more of what we do is exposed to scrutiny. As security professionals 
know, the surveillance camera does not need to be turned on to deter. m 


Jim Woodgett studies signalling pathways at the Samuel Lunenfeld 
Research Institute in Toronto, Canada. 
e-mail: woodgett@lunenfeld.ca 
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When alkanes 
turn tail 


Alkanes are molecules that 
contain only carbon and 
hydrogen atoms, connected 
by single bonds. Short-chain 
alkanes such as butane and 
octane — which contain linear 
chains of four and eight carbon 
atoms, respectively — stretch 
out in extended zig-zags. 
However, longer hydrocarbon 
chains tend to fold themselves 
into hairpin structures. 
Ricardo Mata, Martin 
Suhm and their colleagues at 
the University of Gottingen, 
Germany, determined the 
point at which this transition 
becomes energetically 
favourable. The researchers 
performed spectroscopy 
on supersonic jets of alkane 
molecules at temperatures of 
100-150 kelvin — and found 
that the folded structure 
becomes more stable than the 
extended conformation when 
an alkane chain is around 
18-19 carbon atoms long. 
The result broadly agrees 
with the authors’ quantum 
calculations, and can be used 
to train computer models of 
molecular mechanics. 
Angew. Chem. Int. Edn 
http://dx.doi.org/10.1002/ 
anie.201202894 (2012) 


Excavation of 
a digger 


Examination of a 57-million- 
year-old nearly complete 
fossil skeleton (selected bones 
pictured) has advanced a long 


MATERIALS 


Why barnacles stick around 


Barnacles are among the clingiest of creatures, 
but how they manage to stick so tenaciously to 


surfaces is unclear. 


When Jaimie-Leigh Jonker of the National 
University of Ireland, Galway, and her 
colleagues examined the barnacle Lepas 
anatifera, they found that its adhesion system 
is radically different from that of other clingy 
sea creatures, such as mussels and tubeworms. 


debate over the place of the 
mammal Ernanodon antelios 
in evolutionary history. 
The fossil of the ancient 
mammal was discovered 
in rocks in Mongolia. Peter 
Kondrashov and Alexandre 
Agadjanian from the Borissiak 
Paleontological Institute of the 
Russian Academy of Sciences 
in Moscow describe E. antelios 
as having strong forelimbs 
and large claws, which it used 
to scratch and dig for food. 
Examination of the bones 
led the authors to suggest 
that the mammal is 
more closely related to 
pangolins than it is to 
armadillos and anteaters. 
J. Vertebr. Paleontol. 32, 
983-1001 (2012) 
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Large, single-cell glands in L. anatifera secrete 
a clumpy substance filled with sticky proteins, 


although exactly how the glue works remains 


mysterious. 


(2012) 


The mystery of 
high seas methane 


Marine microbes offer a 
plausible explanation for 

the surprising abundance of 
methane in oxygenated parts 
of the ocean. 

Scientists have previously 
theorized that ocean methane 
might be a by-product of 
microorganisms’ use of 
methylphosphonic acid as a 
source of phosphorus. But it 
was unclear where the acid 
itself came from. William 
Metcalf and Wilfred van 
der Donk at the University 
of Illinois in Urbana and 
their colleagues show that a 


Researchers hope that future studies of 
barnacle glue will yield better adhesives, 
particularly for medical applications. 

J. Morphol. http://dx.doi.org/10.1002/jmor.20067 


microbe called Nitrosopumilus 
maritimus carries genes 
that encode a pathway for 
methylphosphonate synthesis. 
A crucial gene in this 
pathway is also found in 
many other marine microbes, 
suggesting that these organisms 
may be the source of the 
unexplained ocean methane. 
Science 337, 1104-1107 (2012) 


Small families in 
rich societies 


The tendency of families in 
wealthier societies to produce 
fewer children is hard to 
explain in evolutionary terms. 
A study of Swedish families 


R. HODDINOTT/NATUREPL.COM 


examines the paradox, known 
as demographic transition. 

One model proposed to 
explain the phenomenon 
holds that fewer offspring 
receive more resources, 
making them more likely to 
have offspring themselves. The 
model posits that richer people 
might have fewer children, but 
would ultimately have more 
descendents over subsequent 
generations. 

Not so, say Anna Goodman 
of the London School of 
Hygiene and Tropical Medicine 
and her team. In their analysis 
of 14,000 Swedish people 
born between 1915 and 1929 
and their descendents, small 
family size predicted greater 
socioeconomic success in 
children, grandchildren 
and great-grandchildren, 
particularly among families 
that already had high 
socioeconomic status. 

But small family size did 
not translate into greater 
reproductive success among 
the descendants. 

Proc. R. Soc. B http://dx.doi. 
org/10.1098/rspb.2012.1415 
(2012) 


BOTANY 


Plants split cells 
to put down roots 


Plants cells cannot migrate, 
so plants control the 
development of multilayered 
tissues such as roots through 
asymmetric cell divisions that 
create layers with different 
identities and functions. 

A team headed by 
Athanasius Marée of the John 
Innes Centre in Norwich, 

UK, and Ben Scheres at the 
University of Utrecht in the 
Netherlands unravelled 

the molecular pathway that 
regulates these cell divisions in 
the root tip. Stem cells in the 
model plant Arabidopsis are 
triggered to divide unevenly 
by a positive feedback loop 
that takes effect when a protein 
called RETINOBLASTOMA 
ceases to inhibit another, called 
SCARECROW Gradients ofa 
growth hormone and a protein 
called SHORT ROOT ensure 
that this loop is triggered in 


the correct place. Protein 
degradation during the division 
prevents the process from 
continuing indefinitely. 

Cell http://dx.doi.org/10.1016/ 
j.cell.2012.07.017 (2012) 


Disintegrating 
planet spotted 


NASAs Kepler spacecraft seems 
to have spotted a distant, rocky 
planet that is falling apart. 
Kepler hunts for planets 
beyond the Solar System by 
searching for steady, periodic 
dimming in the light of parent 
stars, which indicates the 
passage of an orbiting body. 
In the case of the star KIC 
12557548, however, the drop in 
starlight varies in strength with 
each passage. Scientists have 
suggested that this variability is 
a sign of an orbiting planet that 
is trailed by a large dust cloud. 
Matteo Brogi of Leiden 
University in the Netherlands 
and his team modelled the 
dust cloud and found that its 
presence could indeed explain 
the Kepler data. The cloud is 
probably the result of the planet 
being bombarded by so much 
stellar radiation that it has 
begun breaking up into dust. 
Astron. Astrophys. http:// 
dx.doi.org/10.1051/0004- 
6361/201219762 (2012) 


Pruning back 
carbon estimates 


Incorporating tree-height 
data into calculations of the 
amount of carbon stored in 
tropical forests reduces the 
estimates by roughly 13%. 

Ted Feldpausch of the 
University of Leeds, UK, and 
his team analysed data from 
327 tropics-wide plots, as 
well as 20 sites where tropical 
trees have been cut down, 
collecting data on factors such 
as the weight and height of 
the trees, and their carbon 
density. The team found that 
information on tree height was 
crucial for making accurate 
biomass estimates, and that the 
relationship between height 
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Hunter-gatherer workout disproved 


3 HIGHLY READ 
on www.plosone.org 


in August 


Despite their very different lifestyles, a 
hunter-gatherer expends about the same 
amount of energy each day as the average 


person in Europe or the United States. 

For 11 days, Herman Pontzer of Hunter College in New 
York and his colleagues measured daily energy expenditure 
and physical activity levels in 30 adults from a Hadza hunter- 
gatherer group in Tanzania. Controlling for factors such as age, 
sex, body fat and body mass, the researchers compared their 
results to individual and population data from a spectrum of 
societies, including Western countries. Hadza individuals walk 
longer distances and forage for resources. So, unsurprisingly, 
they had higher physical-activity levels than Westerners. 
However, on average, both groups used the same amount of 
energy on a daily basis, as well as when walking or resting, 
suggesting that the rate of energy expenditure is an evolved trait 


that is independent of culture. 


Obesity trends in Western populations could be unrelated 
to a sedentary lifestyle, the researchers suggest. 


PLoS ONE 7, e40503 (2012) 


and carbon storage varied by 
region. 

The authors underscore 
the importance of including 
better data in biomass maps, in 
which field measurements are 
increasingly being integrated 
with remote-sensing data to 
improve accuracy. 
Biogeosciences 9, 3381-3403 
(2012) 


Sticking the 
unstickable 


Researchers have succeeded 
in sticking together two 
supremely unsticky polymers 
— Teflon and cross-linked 
poly(dimethylsiloxane), 
the slippery coating used as 
backing paper for stickers. 
The secret to their success 
lies in tetrapodal zinc oxide 
crystals: micrometre-scale 
structures (pictured) shaped 
rather like children’s jacks. 
Strewing these between the 
polymers and heating the 
resulting sandwich to 100°C 
for 40 minutes creates a kind 
of ‘micro/nano Velcro. The 
polymers can be peeled apart 
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only by applying a force of 
about 200 Newtons per metre 
— more than that required to 
unstick Scotch tape. 

Rainer Adelung and his 
team at the University of 
Kiel, Germany, did not stick 
the unstickable for glory 
alone. Stuck together, these 
surfaces will have applications 
in technologies such as 
membranes for separating 
liquids, and biomedical 
implants. 

Adv. Mater. http://dx.doi. 
org/10.1002/adma.201201780 
(2012) 
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Genome decoded 
The Encyclopedia of DNA 
Elements (ENCODE) 
consortium this week 
publishes the fruits of its 
endeavour to understand how 
human cells use the genomic 
code. Across 30 papers 
published in Nature (see 
page 45), Genome Research 
and Genome Biology, the 
team reveals that more than 
80% of the human genome’s 
components have now 

been assigned at least one 
biochemical function. See 
nature.com/encode for more. 


Resistance warning 
More than 40% of multidrug- 
resistant (MDR) tuberculosis 
infections are also resistant to 
some of the common second- 
line backup drugs, according 
to research published on 

29 August (T. Dalton et al. 
Lancet http://doi.org/h8r; 
2012). MDR strains are 

not routinely screened for 
resistance to second-line drugs 
in the poor countries where 
the incidence of tuberculosis 
is highest. Out of 1,278 people 
with MDR tuberculosis, 6.7% 
could be classified as having 
extensively drug-resistant 
tuberculosis — almost 
untreatable strains that are 
resistant to several common 
backup drugs. See go.nature. 
com/dklimh for more. 


Virus discovery 
Anew type of phlebovirus 
causing fever, severe fatigue 
and nausea has been identified 
in a paper published on 

30 August (L. K. McMullan 

et al. N. Engl. J. Med. 367, 
834-841; 2012). Found in 
Missouri, it is the first virus 
pathogenic to humans to be 
discovered in the United States 
since hantavirus in 1993. 
Dubbed the Heartland virus, 
the phlebovirus is probably 
spread by the lone star tick 


One in five invertebrates face extinction 


The first comprehensive effort to review the 
conservation status of the world’s invertebrates 
shows that about one-fifth of species are 

at risk of extinction, according to a report 
from the Zoological Society of London. Such 
creatures are thought to represent around 


(Amblyomma americanum) 
and is distantly related to a 
tick-borne and potentially 
lethal phlebovirus discovered 
in China last year. The two 
Missouri men infected with 
the virus recovered, however. 


| __BUSINESS 
Drug hope dashed 


Prospects for a new class of 
drug to treat schizophrenia 
were scotched on 29 August, 
when pharmaceutical giant 
Eli Lilly halted the late-phase 
clinical trial of its drug 
pomaglumetad methionil, 
also known as mGlu2/3, 
which modifies glutamate 
neurotransmission in the 
brain. The company, based in 
Indianapolis, Indiana, said the 
drug seemed to be ineffective. 
Current schizophrenia drugs 
work primarily by reducing 
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levels of the neurotransmitter 
dopamine in the brain, 

but they do not control all 
symptoms of the illness. 


Late apology 
Pharmaceutical company 
Griinenthal, based in Aachen, 
Germany, has apologized for 
the first time for the effects 

of the drug thalidomide. The 
firm developed the drug, 
which was used to treat 
morning sickness in pregnant 
women between 1957 and 
1961. Thalidomide was 
withdrawn after causing birth 
defects in thousands of babies. 


| FUNDING 
ArXiv boost 


The arXiv preprint server at 
Cornell University Library in 
Ithaca, New York, is to get up 
to US$350,000 a year for the 


99% of the biodiversity on Earth. The report 
suggests that the greatest threat is to freshwater 
invertebrates, followed by terrestrial and 
marine invertebrates, such as nudibranch sea 
slugs (Hypselodoris kaname, pictured). See 
go.nature.com/r2uf2y for more. 


next five years from the Simons 
Foundation, a charity based in 
New York that supports basic 
research. The sum includes an 
unconditional annual grant of 
$50,000, with the remainder 
depending on matching funds 
from arXiv’s other donors, the 
library said on 28 August. The 
foundation was set up in 1994 
by mathematician and hedge- 
fund manager James Simons 
and his wife, Marilyn. See 
go.nature.com/xfapfr for more. 


Carbon trade grows 


Australia announced on 
28 August that it is to join 
the European Union (EU) 
Emissions Trading System, 
marking the first time that 
a non-European country 
has linked up with the 
greenhouse-gas-reduction 
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strategy. Australian firms will 
be able to cover up to 50% 

of their carbon emissions by 
purchasing carbon permits 
issued to European companies 
from 2015. EU companies 

will be able to buy Australian 
permits from 2018. 


Forest code final 


Brazil's controversial forest- 
protection law reached 

what is likely to be its final 
form on 29 August, after a 
congressional committee 
made further changes to the 
version proposed by President 
Dilma Rousseffin May. The 
text further reduces protection 
for forests abutting rivers, 

for example. See go.nature. 
com/34qwnl for more. 


Embryo ruling 

The European Court of Human 
Rights ruled on 28 August 

in favour ofan Italian couple 
who want to be able to screen 
their in vitro fertilized embryos 
for a disease-causing gene 
before implantation. A 2004 
Italian law currently bans 
preimplantation genetic 
diagnosis. The couple both 
carry mutations that cause 
cystic fibrosis, and their first 
daughter has the disease. 


Wolves delisted 

The US Fish and Wildlife 
Service has removed grey 
wolves from the endangered- 
species list for Wyoming, the 
last state in which hunting 


US President Barack Obama 

has signed new rules requiring 
car and truck manufacturers 

to almost double average 

fuel efficiency by 2025. First 
announced a year ago, the 
standards approved on 28 August 
would see US cars reach the 
current efficiency of Japanese 
cars by the mid-2020s (see chart). 
They would also bring emissions 
down to around 107 grams of 
carbon dioxide per kilometre 
travelled (behind the target 

of 95 g CO, per km set by the 
European Commission for 2020). 


of the animals was regulated 
by the federal government 
(see go.nature.com/4zmmic). 
Wolves will be managed by 
the state from 30 September, 
which will probably mean 
that wolves can be shot on 
sight outside protected areas 
such as Yellowstone National 
Park. Environmental groups 
have promised legal action to 
reverse the move. 


Misconduct verdict 


Shane Mayack, a former 
postdoctoral researcher at 

the Joslin Diabetes Center, 

an affiliate of Harvard 
Medical School in Boston, 
Massachusetts, duplicated 
figures in two stem-cell papers 
and poached figures from 
other sources, an official 
investigation by the US 

Office of Research Integrity 
has concluded. The papers 

(S. R. Mayack et al. Nature 
463, 495-500; 2010, and 

S.R. Mayack and A. J. Wagers 
Blood 112, 519-531; 2008), 
had already been retracted 

by co-author Amy Wagers, a 
stem-cell biologist at Joslin and 
Mayack’s mentor. See go.nature. 
com/jzdtny for more. 


Biosecurity leader 
Samuel Stanley, president 

of Stony Brook University 

in New York, will serve as 
chair of the US National 
Science Advisory Board for 


Biosecurity. The board has 
been enmeshed in controversy 
for recommending in 
December 2011 that two 
research papers on highly 
pathogenic avian influenza 
HS5N1 be redacted for safety 
and security reasons, before 
finally voting in favour of 

full publication in March 

this year. Stanley (pictured) 
replaces acting chair Paul 
Keim, a microbiologist at 
Northern Arizona University 
in Flagstaff. All current board 
members are to be replaced. 


EVENTS 


Student sit-in 


Students from Nile University 
in Giza, Egypt, last week forced 
their way into the Zewail City 
of Science and Technology on 
the outskirts of Cairo. They 
were protesting about the 
university no longer having 
access to buildings on the 
Cairo site. The university 

had built on land given to it 
by former-president Hosni 
Mubarak’s government, but 


CRACKING DOWN ON GAS GUZZLERS 


Finalized US vehicle standards would almost double fuel 
efficiency by 2025 — but would still lag behind other nations. 
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SEVEN DAYS | THIS WEEK 


6-8 SEPTEMBER 
The progress of projects 
to chart the epigenome 
will be reviewed at 

the International 
Human Epigenome 
Consortium’s second 
meeting in Seoul, South 
Korea. 
go.nature.com/zncest 


10 SEPTEMBER 

The Balzan prizes are 
announced in Milan, 
Italy. This year sees two 
awards set aside for the 
sciences: for epigenetics 
and solid-Earth 
sciences (each worth 
US$787,000). 
go.nature.com/wfww6q 


that gift was rescinded after 
the January 2011 revolution 
and the land given to the 
Zewail City. Nobel laureate 
Ahmed Zewail, a chemist 

at the California Institute of 
Technology in Pasadena, is 
leading the negotiations with 
Nile University to try to settle 
the dispute. See go.nature.com/ 
juxrba for more. 


Virus puzzles 


An outbreak of hantavirus 
originating in California's 
Yosemite National Park has 
so far affected at least six 
people, two of whom have 
died, the National Park 
Service (NPS) announced 

on 31 August. The NPS has 
tried to contact around 1,700 
people who stayed in cabins 
at one of the park villages 
between mid-June and mid- 
August. The virus is spread 
by rodent droppings, and 

this outbreak has puzzled 
medical researchers as the 
rare previous cases originated 
from a single cabin on each 
occasion. But this time, the 
infected visitors had stayed in 
different cabins. See go.nature. 
com/v86lxo for more. 
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Amyloid plaques accumulate in the brains of Alzheimer’s patients (left), but not in unaffected brains (right). 


Alzheimer’s drugs 
take a new tack 


Hopes pinned on pre-emptive clinical trials after latest setbacks. 


BY EWEN CALLAWAY 


fter a summer marred by disappoint- 
A= clinical-trial results in patients with 

Alzheimer’s disease, drug developers 
are regrouping to plot a fresh course in the 
battle against the devastating disorder. 

The bad news began in July and August, 
when Johnson & Johnson and Pfizer learned 
that their biological drug bapineuzumab had 
failed to show any benefit in two large trials. 
Then, on 24 August, Eli Lilly said that its drug 
solanezumab had not hit its goal of significantly 
slowing the memory decline and dementia 


that characterize Alzheimer’s disease. 

Both of the failed drugs targeted amyloid-f, 
a protein that forms plaques in the brains of 
patients with the disease and that has long 
been the prime suspect for causing it. But rather 
than abandoning the amyloid hypothesis, 
scientists are pinning their hopes on innovative 
clinical-trial designs and new diagnostics that 
would allow them to test 
compounds earlier in the 
disease and gauge their 
efficacy more quickly. 

Many worry, however, 
that investors spooked 


Read Nature's 
Outlook on 
Alzheimer’s disease: 
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by the hundreds of millions of dollars spent 
on failed trials will be reluctant to support a 
continuing search for effective treatments for 
Alzheimer’s and other dementias, which affect 
an estimated 36 million people worldwide. 
“Money is tight,’ says Husseini Manji, global 
therapeutic area head in neuroscience at John- 
son & Johnson in New Brunswick, New Jersey. 
But “we're still very committed. We think this 
is a major societal problem that needs tackling” 

Amyloid-B plaques are thought to cause 
Alzheimer’s disease by killing neurons and 
severing their connections to their neighbours. 
But the evidence is circumstantial. Autopsies of 
patients show that larger numbers of plaques 
occur in more severe cases of the disease. 
Also, mutations in the gene responsible for 
amyloid-f seem to have either a risk-enhancing 
or a protective effect. Yet despite all the money 
invested in amyloid-targeting drugs, “we need 
to confirm or refute the amyloid hypothesis’, 
says Paul Aisen, a neuroscientist at the Univer- 
sity of California, San Diego. 

The first results for solanezumab, released 
by Eli Lilly, which is headquartered in Indian- 
apolis, Indiana, seem to support the hypoth- 
esis. The drug is meant to recognize and block 
amyloid-f before it forms plaques. In patients 
with mild and moderate forms of disease, 
however, solanezumab failed to meet its main 
goals of slowing the decline in memory and 
other cognitive measures, or in the ability to 
perform tasks such as eating and maintaining 
personal care. But other analyses suggest that 
the drug slowed cognitive decline in patients 
with milder forms of Alzheimer’s. No data 
have been released on the magnitude of these 
improvements, though, so it is unclear whether 
they are enough to make a difference to 
patients’ lives. 

“From a purely scientific standpoint, we're 
pleased at the results,” says Eric Siemers, medi- 
cal director of Lilly’s Alzheimer’s team. “These 
are the first clinical-trial data that would also 
support the amyloid hypothesis.” Investors and 
scientists will get a clearer picture this autumn, 
when further data from this summer's trials of 
more than 2,000 patients will be presented at 
conferences. 

The bapineuzumab trials seem to have been 
more of an unqualified failure. This antibody 
drug targets the amyloid-f plaques, in hopes of 
awakening the immune system to clear them 
from the brain. But two trials in approximately 
2,400 patients failed to show any benefit 
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TACKLING ALZHEIMER’S EARLY 


Three studies aim to assess the effects of trial drugs on asymptomatic people. 


Trial name Aim 


Alzheimer’s Prevention Initiative 


Dominantly Inherited Alzheimer 
Network 


Anti-amyloid treatment in 
asymptomatic Alzheimer’s disease 


> compared with a placebo, although this may 
have been because the drug was administered 
in lower doses than solanezumab, owing to 
its higher toxicity. Johnson & Johnson and its 
partner Pfizer, headquartered in New York city, 
say that they will vastly scale back development 
of bapineuzumab. 

Increasingly, researchers think that the 
problem lies not so much with the strategy 
of targeting amyloid-f as with the timing of 
treatment. “The major conundrum in the field 
is: ‘are we just treating people too late?;” says 
Ronald Petersen, director of the Alzheimer’s 
Disease Research Center at the Mayo Clinic in 
Rochester, Minnesota. Like the fatty plaques 
in coronary arteries, amyloid-f plaques accrue 
over a lifetime, says Petersen. And so, just as 
cholesterol-lowering statins are prescribed 
for patients in middle age to stave off heart 
disease in later life, amyloid-blocking drugs 
given in middle age may prevent Alzheimer’s, 
Petersen says. 

But no one knows when amyloid-blocking 
drugs would need to be taken to prevent the 
disease, and researchers might have to track 
tens of thousands of people for decades to 
determine whether a preventive drug worked. 
“You cant take every 30-year-old off the street 
and try a prevention study,’ says Manji. 

Nonetheless, three studies are set to begin by 
next year that will test whether anti-amyloid 


To test crenezumab in people who have mutations in the presenilin 1 
gene and other genes that cause Alzheimer’s in middle age. 


To test three drugs on asymptomatic people with Alzheimer’s-linked 
mutations in genes for presenilins 1 and 2, and amyloid precursor protein. 


To test a drug in asymptomatic people who have high levels of amyloid-B, 
and some who have a gene variant that increases their risk of Alzheimer’s. 


drugs can forestall early symptoms of Alzhei- 
mer’s and arrest cognitive decline in patients 
who, on the basis of genetic predisposition or 
amyloid levels, have been identified as being 
at increased risk of developing the disease (see 
“Tackling Alzheimer’s early’). 

The Alzheimer’s Prevention Initiative 
will test crenezumab, a drug developed by 
Genentech, based 
in South San Fran- 


“The _ sil cisco, California, in a 
ek ie large Colombian 
t efiel shea family that has a rare 
are wejust mutation predis- 
treating people posing members to 


too late?’.” develop Alzheimer’s 


in middle age. The 
US$100-million trial will focus on asymp- 
tomatic family members for up to five years 
to see if the drug can stave off their inevita- 
ble cognitive decline. The trial will also seek 
to identify biomarkers, such as amyloid lev- 
els from brain scans and in cerebrospinal 
fluid, that could be used to assess whether 
crenezumab and other drugs are effective. 
“We need to launch a new era in Alzhei- 
mer’s-prevention research to set the stage to 
rapidly evaluate treatments,’ says Eric Reiman, 
executive director of Banner Alzheimer’s 
Institute in Phoenix, Arizona, who is co-lead- 
ing the Colombia trial. With such markers 


Length Size Cost 
5 years S00) $100 million 
people 
5 years 160 people $60 million 
for 2 years 
3 years 1,000 $110 million 
people 


identified, drug companies could quickly get 
a sense of whether or not a drug is prevent- 
ing Alzheimer’s, saving precious money and 
time, he says. 

Drug agencies, including the US Food and 
Drug Administration and the European Med- 
icines Agency, are keeping a close watch on 
those efforts. In theory, approval for preventive 
drugs could be assessed on the basis of clini- 
cal trials measuring changes in biomarkers, 
or surrogates, instead of traditional measures 
of cognitive improvement. However, regula- 
tory agencies are likely to set a very high bar 
for what constitutes a proven surrogate, says 
Siemers. 

Reiman’s study is already bankrolled. But 
the two other imminent trials — one led by the 
Alzheimer’s Disease Cooperative Study, a US 
government-funded programme, and the other 
by researchers at Washington University School 
of Medicine in St Louis, Missouri — are still 
looking for money. Many Alzheimer’s experts 
hope that this summer’s bleak news will not 
scare off investors. 

“We've had this concern for quite some time,’ 
says Reiman, “that if these trials were negative 
we would see some major stakeholders and 
investors abandon amyloid-modifying treat- 
ments. We think that would be throwing the 
baby out with the bath water, and abandoning 
Alzheimer’s disease.” = 


CONSERVATION 


India’s forest area in doubt 


Reliance on satellite data blamed for over-optimistic estimates of forest cover. 


BY NATASHA GILBERT 


o judge from India’s official surveys, 

the protection of its forests is a success. 

Somehow, this resource-hungry country 
of 1.2 billion people is managing to preserve its 
rich forests almost intact in the face of growing 
demands for timber and agricultural land. 

But a senior official responsible for assess- 
ing the health of the nation’s forests says that 
recent surveys have overestimated the extent of 
the remaining forests. Ranjit Gill of the Forest 


Survey of India (FSI) claims that illegal felling 
of valuable teak and sal trees has devastated 
supposedly protected forests in the northeast 
of the country. He and other experts also say 
that an over-reliance on inadequate imaging 
by an Indian satellite system is making such 
destruction easy to overlook. 

In February, the FSI, part of the govern- 
ment’s Ministry of Environment and Forests, 
released the India State of Forest Report 2011. 
This biennial survey used images from India’s 
remote-sensing satellite system and estimated 
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that forest covered 692,027 square kilometres 
of the country — roughly 23% of India’s land 
area — a decline of just 367 km’ on the tally 
reported in 2009, and a much smaller loss 
than in Brazil, for example, where more than 
13,000 km’ of forest was cleared over the same 
period. But Gill, a joint director of the FSI, is 
openly critical of the FSI’s assessment. 

“We have to accept the grave reality that the 
current figure of forest cover in India is way 
over the top and based on facile assumptions,” 
Gill argues. To bring these allegations to light, 


he has mounted a legal case for consideration 
by India’s Central Empowered Committee 
(CEC), a panel of experts appointed by the 
nation’s Supreme Court to rule on issues con- 
cerning forests and wildlife. 

Gill alleges that the government of Meghalaya 
state in northeast India has failed to act suffi- 
ciently on evidence that illegal felling and coal 
mining is ravaging the region’s protected forests. 
He says that he has seen the deforested areas 
at first-hand, and reported them to the state 
government (see ‘On the stump’). He is also 
concerned that the 2011 forest report records 
large areas in Meghalaya as open or dense for- 
est, when he believes that much of the land had 
been cleared and then allowed to regrow sap- 
lings or bamboo. 

On a field survey last year, Gill and three 
FSI colleagues saw that parts of the Dibru 
Hills protected forest in Meghalaya had been 
illegally felled. He confirmed his field obser- 
vations with 2006 data from the LANDSAT 
Earth-observing satellites operated by NASA 
and the US Geological Survey. The satellite 
data showed that roughly 150,000 trees in the 
area had been cut down in the preceding years, 
across an area of about 10 km’. 

Gill also points to an investigation in 2006 
by Meghalaya state’s forest and environment 
department. The report, which he obtained 
through a freedom-of-information request and 
showed to Nature, found illegal saw mills oper- 
ating in the area, as well as freshly felled logs. 
The region has “come under tremendous pres- 
sure and suffered serious depletion, which has 
reached alarming proportions’, that report says. 

According to documents submitted to the 
CEC, the Meghalaya state government claims 
that only 670 trees were felled in the Dibru 
Hills forest from 2004 to 2007. In Gill’s view, 
“the records and reports of the government of 
Meghalaya are not a true picture of the posi- 
tions on the ground”. P. B. O. Warjri, chief sec- 
retary of the government of Meghalaya, told 
Nature that Gill’s claims are “not true”. 

But another state government report 
obtained by Gill documents similar illegal 
deforestation in the nearby Rongrenggre pro- 
tected forest, where 60-70% of the tree cover 
has been lost. The report also found evidence 
that local forest rangers were involved in the 
illegal timber trade, and that illegal coal min- 
ing in the area was taking place in “full knowl- 
edge” of the rangers. Gill is concerned that 
similar lapses are happening, and not being 
reported, in other parts of the country. 

Other tropical-forest researchers share Gill’s 
fears about India’s forests. “The ongoing loss 
and attrition of native forest in India is quite 
widespread, although it isn't being captured by 
the government's satellite data on forest cover,” 
says William Laurance, a 


conservation biologist at <> NATURE.COM 
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in Cairns, Queensland, Nature India: 
Australia. “Much ofthis  www.nature.com/nindia 


ON THE STUMP 


Some of the protected forests in Meghalaya state have been 
hit by illegal logging, according to an Indian forest official. 
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Forest officer Ranjit Gill says that he has evidence of widespread deforestation in Meghalaya (above). 


forest disruption is illegal, and encroachment 
into protected areas and reserves is not uncom- 
mon, in my experience: : 

Anil Kumar Wahal, the director of the 
FSI, denies that forest cover has been over- 
estimated. The FSI team that conducted the 
field visit in May 2011, of which Gill was part, 
“reported a few sporadic patches of felling, and 
old stumps in the field, but nothing as glar- 
ing as felling of vast swathes of forest’, he says. 
But Wahal admits that the “selective” cutting 
of trees “would not register in the satellite 
imagery due to the technological limitation 
of the medium-resolution sensor used for the 
purpose of forest-cover mapping”. 

Gill notes that the instrument, which flies 
on an Indian remote-sensing satellite, pro- 
duces images with a resolution of 23.5 metres 
per pixel, too coarse to unequivocally identify 
small-scale deforestation. Instead, he says, the 
forest survey should use a newer instrument, 
already operating on an Indian satellite, that 
provides a resolution of 5.8 metres per pixel. 

The FSI uses the lower-resolution 


instrument for its national survey because it 
offers continuous coverage of very large areas, 
explains Wahal. “Gap-free data are really 
essential,’ he says. “Using high-resolution data 
would also entail much more manpower and 
time, so a balance has to be struck.” The FSI is, 
however, using the higher-resolution instru- 
ment for some small-scale surveys, he adds. 

Gill argues that the FSI still needs to conduct 
more on-the-ground surveys to corroborate its 
satellite estimates of forest cover. Without this 
reality check, it can be difficult to tell the differ- 
ence between native forests and, for example, 
bamboo. He is calling on the CEC to order a 
visit to the forests to investigate the extent of 
the destruction. A verdict is expected from the 
CEC by the end of the year. 

Last year, India’s government grabbed 
headlines with a US$10-billion, decade-long 
plan — the National Mission for a Green 
India — to create or improve 10 million hec- 
tares of forest. But if Gill is right, it faces a 
more urgent task: to chart and protect the 
forests that remain. = 
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Trade rules that would raise the cost of HIV medicines come under fire at a July rally in Washington DC. 


PUBLIC HEALTH 


Trade deal to curb 
generic-drug use 


Tighter patent rules could raise drug costs in poor countries. 


BY AMY MAXMEN 


CC anted,” the notice reads, in an 
American old-west style font, 
“Negotiating text of the Trans- 


Pacific Partnership Agreement.” The online 
advert invites visitors to contribute to a reward 
payable to the WikiLeaks website should it 
manage to expose the trade agreement. As 
Nature went to press, the reward stood at 
US$24,490. 

The tactic, employed by the activist group 
Just Foreign Policy in Washington DC, may 
be extreme, but it reflects a broader unease 
over a negotiation process that the advert says 
“could affect the health and welfare of billions 
of people”. At issue are industry-friendly rules 
governing drug patents that could be written 
into the final text of the Trans-Pacific Partner- 
ship Agreement (TPP). The provisions could 
boost drug development and profits for the 
pharmaceutical industry, but also curb the 
use of cheaper generic medicines in low- and 
middle-income nations. 

“In many parts of the world, access to 
generic drugs means the difference between 
life and death,” says US congressman Henry 
Waxman (Democrat, California). He is one 


DEATH UNDER 
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of several US politicians voicing concern over 
the closed-door TPP negotiations and the 
influence that the pharmaceutical industry is 
thought to be exerting on the process through 
US trade representatives. With the latest 
round of talks set to begin on 6 September in 
Leesburg, Virginia, public-health advocates are 
expressing fears that the outcome will reduce 
access to medicines. 

Besides the United States, ten Pacific coun- 
tries representing 34% of US trade have so far 
agreed to join the TPP — Australia, New Zea- 
land, Singapore, Malaysia, Brunei, Vietnam, 
Peru, Chile, Canada and Mexico. The agree- 
ment, which could come into effect as early as 
next year, spans several trade areas, meaning 
that some countries may be tempted to forgo 
access to generic drugs in exchange for better 
access to US markets in other industries. 

According to previously leaked documents, 
the TPP looks likely to strengthen patent pro- 
tection for drugs more than any trade agree- 
ment so far. Whereas the current World Trade 
Organization (WTO) agreement sets a mini- 
mum 20-year period for patents around the 
world, the TPP would follow US practice in 
extending patents beyond 20 years when the 
drug-approval process has delayed a drug’s 
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market entrance. Partner countries would also 
be pressed to award new patents for off-patent 
drugs that have been formulated in a new way 
or approved for a new set of patients. 

This practice restricts access to medicines 
in poor countries because it extends pat- 
ent monopolies. For example, according to 
Médecins Sans Frontiéres (also known as 
Doctors Without Borders) in Geneva, Switzer- 
land, countries that have rejected patents on 
new formulations of the off-patent HIV drug 
Abacavir now sell generic versions for as little 
as $139 per person per year, whereas in Malay- 
sia paediatric Abacavir costs $1,200 per child 
per year, because the country granted the new 
formulation a patent. But a spokesperson from 
the Office of the US Trade Representative says 
that patenting new formulations of old drugs 
provides an incentive for drug companies to 
develop adaptations “that are valued in devel- 
oping countries, like heat-stabilized medicines 
for places without refrigeration”. 

Industry stakeholders say that drug com- 
panies need greater protection as the indus- 
try enters an unprecedented period of patent 
expirations (see Nature 480, 16-17; 2011) and 
faces stiff competition from generics produced 
in India and China. They argue that sales of 
generics need to be restricted if companies are 
to recoup the millions they invest in develop- 
ing new drugs. “If TPP countries wish to be 
those in which innovation flourishes, they 
should have strong intellectual property,’ says 
Stephen Ezell, senior analyst at the Informa- 
tion Technology and Innovation Foundation, 
anon-profit think tankin Washington DC that 
supports patent extensions. 

The negotiators are considering special 
protections for biologic drugs — those based 
on large biological molecules. One possibil- 
ity under discussion would grant companies 
a 12-year period of exclusivity on clinical-trial 
data related to the biologics they develop. Mak- 
ers of equivalents of small-molecule drugs 
rely on such data when they seek government 
approval for their products. Without access 
to the data, the generics company would have 
to repeat the costly clinical trials or delay the 
time-consuming approval process for its prod- 
uct by 12 years. Charlene Barshefsky, a former 
US trade representative who now advises com- 
panies on trade law, explains that the biologics 
market, which was worth US$149 billion glob- 
ally in 2010, needs extra protection because bio- 
logics cost more to develop than small-molecule 
drugs. “Iam not saying that a foreign innovator 
cannot develop their own biologic drug, they 
just need to do their own homework. she says. 

More generally, stronger patent provisions 
would harm small, domestic manufacturers of 
generic drugs in Malaysia and Vietnam, says 
Shawn Brown, formerly vice-president for 
international affairs and state government at 
the Generic Pharmaceutical Association based 
in Washington DC. They would also cut sales 
for larger generics manufacturers in the United 
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States, Australia and Canada that supply low- 
cost drugs to the world. 

Some countries whose governments purchase 
drugs with a set budget are also alarmed by signs 
that the TPP may grant new negotiating powers 
to the industry. In New Zealand, for example, a 


government agency called Pharmac determines 
whether the benefits of a new drug warrant the 
cost, or if the country is better off sticking with 
a cheaper alternative. A leaked TPP provision 
would empower drug companies to appeal 
such decisions. “We have good processes for 
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ensuring what is for the good of our population, 
not for the good of lobby groups, and I don't 
see why they need to interfere with that,’ says 
Marilyn Head, a policy analyst at the New 
Zealand Nurses Organisation in Wellington, 
who adds: “Bugger off, quite frankly.” m 


Flectro-optic dye 
triggers ethics row 


Dispute puts focus on reporting standards for major grants. 


BY EUGENIE SAMUEL REICH 


hen a colleague questions a 
researcher’s hypothesis, how far 
must the researcher go in telling 


his prospective funders about those doubts? 

The question sits at the heart of a dispute 
that has prompted a government review of 
alleged omissions in reports from a science and 
technology centre funded by grants totalling 
US$36 million over 10 years from the National 
Science Foundation (NSF). The review, by the 
NSF's inspector general, is not yet complete, 
but the affair highlights a grey area in the 
agency’s rules for grant recipients: although the 
rules require principal investigators to disclose 
any problems they encounter in pursuit of their 
research goals, they offer no guidance on how 
to assess when a colleague's scepticism about a 
specific issue merits reporting. 

The issue became public in late July, when 
Bart Kahr, a chemist at New York University in 
New York city, described his side of the dispute 
at a meeting of the American Crystallographic 
Association in Boston, Massachusetts. But it 
goes back more than a decade, to work led by 
Larry Dalton at the University of Washington in 
Seattle in 2000. Motivated by the rapid expan- 
sion of the Internet, the group was developing 
modulators, colloquially called ‘opto-chips, 
that convert electrical to optical signals, a more 
efficient medium for long-distance communi- 
cation. Dalton and his team reported’ record- 
breaking performances by electro-optic devices 
based on dye molecules they had designed. 
And their paper suggested that the key to the 
devices’ performance lay in the way the mol- 
ecules lined up in an electric field. 

The result was discussed in a 2001 grant pro- 
posal to the NSE which subsequently funded 
the Center on Materials and Devices for Infor- 
mation Technology Research at the University 
of Washington, with Dalton as its director. 
Research continued on the devices, and Kahr 
joined the centre in 2003. Several groups at the 


centre and elsewhere were continuing to report 
improved performances for the devices, but 
Kahr began to doubt the mechanism that had 
been proposed to explain how they worked. 

Kahr obtained samples of dye molecules 
from another researcher at the centre, Alex 
Jen, and measured their absorption of polar- 
ized light — a way to test their alignment — in 
an electric field. Kahr reported to Jen that his 
results suggested there was no strong align- 
ment and that future efforts to improve the 
devices by optimizing the dye alignment might 
not work unless the mechanism was under- 
stood. But the centre’s annual report to the NSF 
for 2003-04 did not mention Kahr’s findings. 
Jen, who wrote the relevant section, explains 
that he had a wealth of material to include, and 
that there was no effort to omit Kahr’s results 
because they challenged an aspect of the cen- 
tre’s research direction. 

Alarmed at what he regarded as an unethical 
omission, Kahr complained in 2004 to chem- 
ist Alvin Kwiram, then the centre’s executive 
director. Kwiram says that Kahr’s doubts were a 
distraction from the centre's main goal, which 
was to build and improve working devices. 
Although Kahr believed that understanding 
the mechanism was necessary to improve the 
devices as quickly as possible, Kwiram and oth- 
ers felt that they were already being made more 
effective even though the mechanism was in 
dispute. “This issue [of the mechanism] was 
like a mosquito buzzing around and it was like 
don't bite me right now when we've got bigger 
fish to fry,’ Kwiram says. 

The centre submitted two more annual 
reports without mentioning Kahr’s finding 
that the alignment was weak, and in 2006 the 
centre’s grant came up for a five-year renewal. 
Phil Reid, a chemist at the centre who is now 
its director, says that during a site visit by NSF 
reviewers, Jen mentioned theoretical work 
suggesting that the dye molecules might not 
be aligned as strongly as supposed — work 
also mentioned in the 2005-06 annual report 


although not in connection with Kahr and 
his concerns. Kahr says that he did not have 
an opportunity to present his data to the NSF 
reviewers, and that he subsequently lost fund- 
ing he had been receiving through the centre. 

Kahr moved to New York University in 2009. 
In 2011, Reid, Jen, Dalton and Bruce Robinson, 
a theoretical chemist at the University of Wash- 
ington, published a paper’ presenting their 
own evidence that some dye molecules similar 
to those used in the original work align only 
weakly in an electric field — findings that par- 
alleled those of Kahr. Robinson sees this simply 
as the resolution of a scientific disagreement, 
not a matter of research ethics. “Bart was right,” 
says Robinson, “but so what?” 

After receiving copies of Kahr’s e-mails 
to centre members raising ethical concerns 
about the omissions, the University of Wash- 
ington’s Office of Scholarly Integrity and Ana 
Mari Cauce, dean of the university’s College 
of Arts and Sciences at the time, conducted 
separate investigations of his allegations in 
2010 and 2011. Both cleared Dalton and Jen 
— the only targets of Kahr’s accusations — of 

any violation of eth- 


NSF rules offer ics. Cauce, who is 
no guidance now the university’s 
on how to provost, explained 
assess when in a letter to Kahr 
acolleague’s that Jen’s omission of 
scepticism Kahr’s data from the 
merits reporting. annual reports was 

justified because the 


data were preliminary and because there was 
a scientific disagreement about whether the 
molecules were aligned. 

But Kahr remained unsatisfied and in Janu- 
ary 2011 submitted allegations to the NSF's 
Office of Inspector General. Susan Carnohan, 
a spokeswoman for the inspector general, told 
Nature that the office does not comment on 
ongoing investigations. 

Jason Borenstein, a philosopher who teaches 
responsible conduct of research to science 
and engineering students at Georgia Institute 
of Technology in Atlanta, believes that grant 
applicants should generally disclose a colleague's 
doubts in their reports to funders. “Typically 
it is preferred, if there is space, to say there is 
another viewpoint that could be presented but 
we believe ours is right for the following rea- 
sons,’ he says. “That will make a better case to 
the grant reviewers.” = 
1. Shi, Y. et al. Science 288, 119-122 (2000). 


2. Olbricht, B. C. et al. J. Phys. Chem. B 115, 231-241 
(2011). 
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Databases fight funding cuts 


Online tools are becoming ever more important to biology, but financial support is unstable. 


BY MONYA BAKER 


for scientists to lose access to the online 

tools they use to analyse and share tera- 
bytes of information. Yet funding cuts by the 
US National Library of Medicine (NLM) are 
threatening five widely used biological data- 
bases, and user communities are now rally- 
ing to save them. “The idea that this resource 
could just disappear is a serious problem for 
everyone who relies on it,” says Mark Musen, 
a bioinformatician at Stanford University in 
California, and manager of Protégé, which 
provides open-source software to organize 
and interrelate biological data. 

Protégé has 200,000 registered users, and the 
NLM, part of the National Institutes of Health 
(NIH) in Bethesda, Maryland, has contrib- 
uted millions of dollars to maintain it. But 
in 2007, the NLM decided that it would stop 
supporting infrastructure grants and would 
redirect resources to informatics research, 
says Valerie Florance, director of extramural 
programmes at the library. Consequently, the 
NLM’s support for Protégé and similar pro- 
jects is not being renewed (see ‘Endangered 
databases’). “It is not a reflection of the value 
of the resources to any of their users,’ says Flo- 
rance. “It is part of our determination to put 
our funds into research and training” 

The argument is playing out at other funding 
agencies, says David Botstein, a genomicist at 
Princeton University in New Jersey, anda mem- 
ber of the NIH Data and Informatics Working 
Group, which published a draft report on the 
issue in June. “The whole system is rigged 
against infrastructure of any kind? he says, pre- 
dicting that “many, many resources” will face 
similar funding crises in the near future. 

The Biological Magnetic Resonance Data 
Bank (BioMagResBank, or BMRB), for exam- 
ple, has been funded by the NLM since 1990 and 
holds more than 7,500 entries on biomolecules. 
Structural biologists use the nuclear magnetic 
resonance data to probe questions such as how 
proteins contort as they catalyse reactions. 

More than 90 scientists have written letters 
to Nature Structural and Molecular Biology this 
month in support of the BMRB (J. Markley et al. 
Nature Struct. Molec. Biol. 19, 854-860; 2012). 
Inés Chen, chief editor of the journal, says that 
losing the database would deprive researchers of 
access to crucial data. “As journals, we cannot 
host all the data that are part of the paper, and 
so if they disappear, it’s a big deal” 

John Markley, director of the BMRB and 


[i the era of ‘big data, it is a bitter blow 


ENDANGERED DATABASES 

The US National Library of Medicine (NLM) is cutting resources that biologists say are vital to their research. 

Resource NLM-funded | Function Usage Last NLM 

since award 

Protégé ISO) Creating tools to organize and | 200,000 $956,625 
analyse data registered users 

BioMagResBank 1990 Holds spectroscopy data for 500-1,000 unique $727,129 
biomolecules users per day 

Repbase 1994 Identifying families of non- 8,000 registered $551,544 
coding DNA across species users 

REBASE 1995 Finding where enzymes bind 495,844 website $235,911 
to and cut DNA hits per month 

CASP 2001 Testing techniques to predict More than 100 $515,168 
protein structure research groups 

participate 


a structural biologist at the University of 
Wisconsin-Madison, hopes to attract other 
federal funders to support the database. 
Another option is to charge users, but Musen 
calls that “absurd”, arguing that it would dis- 
courage scientists from accessing sites and, in 
the case of Protégé, from contributing the code 
and plug-ins that make it a useful resource. 
Musen wants to win funding from the NIH to 
keep Protégé going as a key component of new 
research projects. In June 2011, he submitted 
a grant application with more than 100 letters 
of support from scientists; reviewers acknowl- 
edged the letters but said that they had nothing 
to do with the grant'’s specific research goals, and 
turned it down. Musen resubmitted the applica- 
tion, and should learn the results this month. 
Other databases are putting their trust in 
commercial sponsors. REBASE, which holds 
data on where enzymes bind to and cut DNA, 
is partially supported by laboratory-reagent 
company New England Biolabs of Ipswich, 
Massachusetts. When federal money runs 
out in 2014, the company will take on the full 
costs, says Richard Roberts, chief scientific 
officer of New England Biolabs and founder 
of REBASE. But he acknowledges that this 
potentially leaves the database at the mercy of 


> 


Q&A 


shifting commercial priorities. 

The least vulnerable databases are those 
directly run by government agencies, says 
Francis Ouellette, a bioinformatician at the 
Ontario Institute for Cancer Research in 
Toronto, Canada. Investigator-driven data- 
bases face more challenges because “they don’t 
fit the research-based standard model” used 
to dispense grants. Cutting funding for poorly 
performing or obsolete databases is sensible, 
says Ouellette, but choking established sites 
that have significant user communities is 
“really short-sighted. If it’s a good database it 
should be maintained.” 

Florance argues that the NLM should back 
innovation, which is difficult when its funds are 
tied up in infrastructure. “I don’t think anyone 
would say that because they got a grant and built 
a database, they should get money forever.” 

One solution, says Musen, could be to wean 
successful projects off investigator-initiated 
grants and move them into the NIH’s longer- 
term intramural programmes. But Botstein 
thinks that would require a philosophical 
change at the agency. “What's really required 
is an understanding of the larger problem,” he 
says. “This is a big thing, and it will be a big 
thing for years to come.” = 
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Voyager’s long goodbye 


NASA probes find surprises at the edge of the Solar System. 


BY RON COWEN 


re we there yet? Ed Stone, the project 
A= for NASAs two Voyager space- 
craft, wants to know. Since their launch 
in 1977, the probes have ventured billions of 
kilometres beyond the outer planets. Now, 
Stone and his colleagues are looking for signs 
that Voyager 1 may finally be nearing the edge 
of the Solar System — where the heliosphere, 
the bubble of electrically charged particles 
blown outwards by the Sun, gives way to inter- 
stellar space (see ‘Edging into the unknown). 
Detecting and characterizing this thresh- 
old — called the heliopause — would be the 
ultimate bonus for a probe that logged its 35th 
year in space on 5 September. When Voyager 1 
set out, says Stone, a physicist at the Califor- 
nia Institute of Technology in Pasadena, who 
has coordinated the mission since the probes 
launched, “the space age was only 20 years old 
and there was no evidence that any spacecraft 
could travel this long and this far from the Sun”. 
The extraordinarily long-lived Voyager 1 


began detecting hints of a boundary region 
eight years ago. But exiting the Solar System is 
proving to be a longer and more complicated 
affair than Stone and his colleagues had antici- 
pated. By the time Voyager 1 is well and truly 
out, it may have transformed researchers’ ideas 
about the Solar System's invisible edge. 

In the latest twist in the story, the craft seems 
to be traversing an unexpected ‘dead zone. This 
week, Robert Decker, a space scientist at the 
Johns Hopkins University Applied Physics 
Laboratory in Laurel, Maryland, and his col- 
leagues report’ in Nature that at Voyager 1’s 
current location, some 121.6 astronomical 
units (18.2 billion kilometres) from the Sun, 
the average velocity of solar particles has 
dropped to nearly zero. (Voyager 2, which is 
about 3 billion kilometres closer to the Sun 
and moving in a differ- 
ent direction, has yet to 
detect the same reduc- 
tion in velocity.) 

Decker’s team first 
reported’ the change 


Meanwhile, the 
Curiosity rover 
explores Mars: 
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last year, when it had measurements of the 
particles’ velocity only in the radial direction, 
outwards from the Sun. At the time, the team 
thought that the change was a sign that the 
craft was nearing the heliopause, where solar 
particles are expected to collide with powerful 
winds generated by supernovae that exploded 
some 5 million to 10 million years ago. The 
collision would force the solar particles to stop 
moving outwards and push them sideways, 
like a stream of water hitting a solid surface. 
To test the idea, engineers commanded Voy- 
ager | to roll on its side seven times, so that 
its instruments could record particle veloci- 
ties along a line perpendicular to its course. 
Given that sending a command to Voyager 1 
now takes 17 hours, and that the spacecraft’s 
transmitter runs at 23 watts — about as power- 
ful as a refrigerator light bulb — such commu- 
nication is a feat in itself. The researchers were 
astonished to find that the particles had zero 
velocity in this polar direction, too — indicat- 
ing that they were almost stationary rather 
than being buffeted by stellar winds. That 
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Voyager 1 was launched in 1977. Four of its original instruments (labelled in yellow) are still returning data on conditions at the edge of the Solar System. 


cannot happen at the heliopause, says Decker. 
“We therefore conclude ... that Voyager 1 is 
not at the present time close to the heliopause, 
at least in the form that it has been envisioned,” 
the team writes’. 

Decker and his colleagues now think that 
since 2010, when the craft first recorded a 
velocity drop, it has been in an antechamber 
to the heliopause, at least 1 billion kilometres 
thick. Why the particles are becalmed remains 
a mystery, says Stamatios Krimigis, a space sci- 
entist at Johns Hopkins and a co-author of the 
paper. This leaves theorists in a bind. “There 
no longer exists any guidance on what consti- 
tutes getting out of the Solar System and into 
the Galaxy,’ says Krimigis. 

Gary Zank, a theoretical physicist at the 
University of Alabama in Huntsville, disa- 
grees. “I don't regard the paper as forcing us to 
revise our models,” he says. His team and oth- 
ers theorize’ that a magnetic wall in the outer 
heliosphere, caused by a pile-up of magnetic 
field lines, could slow down the flow of charged 
particles and account for the near-zero veloci- 
ties recorded by Voyager 1. 

Although the craft has not yet made it to the 
heliopause, the boundary may be within reach. 
This May, Voyager 1 recorded unprecedented 
bursts of cosmic rays — highly energized pro- 
tons and atomic nuclei — coming from outside 
the Solar System. The spikes returned in July, 
this time along with a drop in the incidence of 
lower-energy cosmic rays thought to be accel- 
erated in the Solar System. The changes sug- 
gest that Voyager 1 is nearing the fringe of the 
Solar System, and could cross the heliopause 
by the end of the year, says Krimigis. But, he 
adds, “nature seems to be much more imagi- 
native than we are, so I could be quite wrong”. 

Indeed, David McComas, a physicist at the 


Southwest Research Institute in San Antonio, 
Texas, and Nathan Schwadron, a plasma phys- 
icist at the University of New Hampshire in 
Durham, suggest an alternative explanation. In 
an article in press in The Astrophysical Journal, 
they propose that Voyager 1 is in a region where 
magnetic field lines running through the outer 
heliosphere link up with the magnetic field of 
the rest of the Galaxy. Here the field would cre- 
ate a conduit for galactic cosmic rays, causing 
the spikes in detection. Cosmic rays accelerated 
within the heliosphere would tend to move 
along other field lines and be less likely to get 
to Voyager. If this model is correct, say McCo- 
mas and Schwadron, the heliopause may still 
be years away. 
When Voyager 1 
does leave the Solar 
System, it may meet 
further surprises. 
Researchers have 
long assumed that a 
bow shock lies out- 
side the heliopause. 
Similar to the shock wave around a supersonic 
aircraft, the bow shock is thought to form as 
the Solar System ploughs through the inter- 
stellar medium, forcing the local ionized gas 
to change density abruptly and discontinu- 
ously. But in May, McComas and his colleagues 
reported’ that data from NASA's Interstellar 
Boundary Explorer (IBEX) mission cast doubt 
on this picture. From Earth orbit, IBEX probes 
the interstellar medium by detecting electri- 
cally neutral atoms that slip into the Solar Sys- 
tem through the heliopause. Its measurements 
suggest that the Sun and planets are moving 
through the interstellar medium about 12% 
slower than previously calculated — too slow 
to generate a bow shock. 


None of this uncertainty bothers Stone, who 
expects both Voyagers to cross the heliopause 
well before 2025, when the craft are due to go 
silent as the plutonium isotopes that supply 
their power run out. On the contrary, Stone 
adds, he is pleased that the one-way journey 
has taken so many unexpected turns. “One 
thing Voyager has taught us is to be prepared 
to be surprised.” = 


1. Decker, R. B., Krimigis, S. M., Roelof, E.C. & 
Hill, M. E. Nature 489, 124-127 (2012). 
2. Krimigis, S. M., Roelof, E. C., Decker, R. B. & 
Hill, M. E. Nature 474, 359-361 (2011). 
3. Zank, G. P. Space Sci. Rev. 89, 413-688 (1999). 
4. McComas, D. J. et al. Science 336, 1291-1293 
(2012). 


The News Feature ‘Making the links’ (Nature 
488, 448-450; 2012) misspelt David 
Lazer’s name and wrongly located him. He 
is at Northeastern University in Boston. 


The News Feature ‘Man of the desert’ 
(Nature 488, 272-274; 2012) got the 
details of Kropelin’s 2005 trip wrong. 

The heavy gunfire heard by the team was 
caused by Darfur rebels killing 20 Sudanese 
soldiers (not the other way round). 


The News Feature ‘Armed resistance’ (Nature 
488, 576-579; 2012) conflated the Puebla 
campuses of the University of the Americas 
and the Monterrey Institute of Technology 
and Higher Education. The former was home 
to the first nanotechnology lab in Mexico, the 
latter was the first institute in Latin America 
to offer an undergraduate programme in the 
field and had a false bomb alert last August. 
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7 AS HE REVOLUTIONIZES IDEAS ABOUT DINOSAUR EVOLUTION. aS “Ss > 
suum XING XU IS HELPING TO MAKE CHINA INTO A PALAEONTOLOGICAL POWERHOUSE. 


BY KERRI SMITH 


alaeontologist Xing Xu bends low over a beautifully preserved 

specimen of the ancient bird species Sapeornis, entombed in 

a glass museum cabinet in Shandong Province, China. The 

bird’s spindly legs stretch as if it were about to stride forward, 
even though the creature has been dead for more than 110 million 
years. From its chicken-sized body juts a fine neck, a delicate skull and 
the clear imprint of a long, jaunty tail feather — something never seen 
before in this species. 


Sapeornis is one of hundreds of plumed specimens pouring out 
of fossil beds in China — most notably out of the rock formations 
in Liaoning Province, northeast of Beijing. Some of the Liaoning 
fossils are the earliest known birds. Others are feathered dinosaurs, 
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Xing Xu stands 
among the remains of 
duck-billed dinosaurs 
in Zhucheng, China. 


LOU LINWEI 


the group that spawned birds millions of years 
before the age of Sapeornis. Together, they are 
among the most important finds in dinosaur 
palaeontology in the past century. 

Xu is at the centre of that bonanza. He is “the 
go-to man in China for anything people want 
to know about dinosaurs’, says Paul Barrett, 
who studies dinosaurs at the Natural History 
Museum in London and first met Xu in the 
1990s, when both were graduate students. 
Xu, who is based at the Institute of Vertebrate 
Paleontology and Paleoanthropology (IVPP) 
in Beijing, has named 60 species so far — more 
than any other vertebrate palaeontologist alive 
today. And he is only 43 years old. 

In describing the flock of feathered fossils, 
Xu has helped to show that birds arose from 
dinosaurs, ending decades of debate. Along the 
way, he has shed light on the origins of feath- 
ers and flight. And he has bucked 150 years of 
received wisdom by declaring that the fabled 
genus Archaeopteryx is not the oldest known 
bird, but rather belonged to a group of dino- 
saurs removed from the avian line’. “He has 
patience and persistence — and an audacity 
when scientific evidence calls for it? says Zhe- 
Xi Luo, who studies fossil mammals at the Uni- 
versity of Chicago in Illinois. 

Even as he unveils new species at a break- 
neck pace, Xu is concerned about the future 
of palaeontology in China and the commer- 
cialization of fossils. Many of the feathered fos- 
sils from Liaoning are dug up by local farmers 
tending their fields, who try to sell them to the 
highest bidder. This fossil ‘grey market’ — it 
is technically illegal to sell fossils in China, 
but the practice continues openly 
— encourages fakery and causes 
specimens to disappear into private 
collections. By cultivating a vast net- 
work of contacts at important fossil 
sites in Liaoning and elsewhere, Xu 
has laboured to ensure that scien- 
tists gain access to the best speci- 
mens. It’s a job that requires hard 
work and luck, he says. “When I started my 
career, I never expected that I would have so 
many discoveries.” 


DINO DISNEY 

Nobody knows what happened about 80 mil- 
lion years ago near what is now the town of 
Zhucheng in Shandong Province, but it must 
have been disastrous. On the outskirts of the 
city, about an hour’s flight south of Beijing, 
hundreds of bones litter a 300-metre stretch 
of hillside. Palaeontologists have been finding 
dinosaurs near Zhucheng for decades, but in 
2008 local farmers unearthed a large commu- 
nity of duck-billed dinosaurs and others that 
had apparently died en masse. 

Xu was called in to investigate and he is now 
studying a possible new species of ceratopsian 
— herbivorous beaked dinosaurs — recov- 
ered from the fossil bed. He is also acting as 
scientific consultant to local administrators, 


who want to build a dinosaur theme park in 
Zhucheng. During a visit to the site in June, Xu 
had hoped to do research, but he ended up cor- 
recting display captions and reading through 
proposals for the park. “In terms of scale it may 
be comparable to Disneyland,” says Xu, a hint 
of trepidation in his voice. 

Fossils are a thriving business as well asa sci- 
ence in China, and palaeontologists often have 
to negotiate with local prospectors and direc- 
tors of museums and tourism bureaux to gain 
access to fossil sites and specimens. Despite 
Xu’s boyish appearance, he is a dexterous dip- 
lomat and has managed to arrange for the most 
scientifically interesting specimens to cross his 
desk, wherever they are found. 

Thanks to those arrangements, Xu has hada 
bounty of fossils to work on, particularly from 
Liaoning. The creatures unearthed there are 
remarkably well preserved, perhaps because 
they were entombed quickly during volcanic 
eruptions and mudslides between 160 million 
and 120 million years ago. The rocks record fine 
details including the imprints of feathers, which 
allowed Xu to determine’ that a fierce 9-metre- 
long tyrannosaurid, which he named Yutyran- 
nus, had a coat of long feathers (see “Xing Xu’s 
feathered friends’). One of Xu’s favourite Liaon- 
ing fossils, Microraptor, is one of the smallest 
known dinosaurs not on the avian line. From 
the imprint of feathers, Xu and his colleagues 
concluded’ that Microraptor had four wings — 
one on each arm and leg — and could probably 
glide. From other Liaoning specimens, he has 
established* that some feathered dinosaurs slept 
curled up, just like birds. 


“MY EXCITEMENT IS PROPORTIONAL TO 
THE INFORMATION YOU GET. AND THOSE 
WERE REALLY EXCITING FOSSILS.” 


When he can find the time, Xu does fieldwork 
of his own (see ‘Dinosaur hunting grounds’). He 
led teams to three sites this summer. Near the 
northern Chinese town of Lingwu, the excava- 
tions turned up a new sauropod — a dinosaur 
from the same group as Diplodocus. In the 
autonomous region of Inner Mongolia, the Xu 
group found a new type of bird and what may 
bea previously unknown theropod — the dino- 
saur lineage that led to birds. At another north- 
ern site, he uncovered a collection of beaked 
dinosaurs. 

To power this dinosaur-discovery factory, 
Xu runs a lab of 14 people, including five stu- 
dents, seven preparators who carefully sepa- 
rate the fossils from the 
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Western palaeontologists. The Natural History 
Museum in London, for example, has just two 
full-time preparators for about 20 palaeontol- 
ogy curators and researchers, says Barrett. 

Xu didn't set out to be a palaeontologist; in 
fact, he had no idea what a dinosaur was until 
he entered university. He was born in the poor 
Western province of Xinjiang in 1969, a few 
years after his parents relocated there as part 
of a Cultural Revolution development initia- 
tive in which educated couples were forced to 
move to rural provinces. 

He excelled in school and in 1988 earned a 
place at Peking University in Beijing, the nation’s 
premier university. Xu wanted to study econom- 
ics, but at the time students had no choice in 
their degrees. For reasons that are unclear to 
him, he was obliged to study palaeontology. 


LATE STARTER 

Xu’s interest in the subject picked up only when 
he reached the third year of a master’s degree 
at the IVPP. He was studying two specimens 
that his adviser, Xijin Zhao, had discovered in 
the 1960s and 1970s and had not found time 
to analyse fully. They turned out to be the ear- 
liest examples of ceratopsians, pushing the 
record of this group back by up to 30 million 
years, from the early Cretaceous period, which 
started 145 million years ago, to the middle or 
late Jurassic period’. “My excitement [over a 
fossil] is proportional to the information you 
get from it,’ says Xu. “And those were really 
exciting fossils.” 

Xu’s timing was perfect. While he was working 
on his master’s thesis, the trickle of dinosaur 
species turning up in China grew to 
a deluge. Funding for palaeontology 
was increasing; farmers in Liaoning 
started recognizing the value of the 
fossils they sometimes found; and a 
burst of construction meant that new 
fossils were being unearthed more 

requently. As a budding dinosaur 
palaeontologist, Xu was well placed 
to study some of those specimens. 

However, fortuitous timing can explain 
only a portion of Xu’s productivity. A large 
part comes from his legendary work ethic. “If 
I want to learn something I put all my time 
into it,’ says Xu. He currently has more than 
20 manuscripts in draft form, including one 
on the Sapeornis specimen from the Shandong 
Tianyu Museum of Nature. He estimates that 
there are eight or nine new species among the 
crop of fossils awaiting publication. 

Even away from his office, any spare moment 
is filled with talk of projects. Outside the Tianyu 
museum, Xu chats to a colleague about Micro- 
raptor and — to make an anatomical point 
— starts drawing a diagram of the creature’s 
feathers in the dust on a nearby car. 

Xu has an international outlook that also 
contributes to his success. From the start of 
his career, he has done what has not come 
naturally to many Chinese palaeontologists 
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— building up a fat book of contacts in the 
United Kingdom and the United States, and 
publishing much of his work in English in 
international journals. Playing to a tougher 
international audience was “really important 
for my career’, says Xu. Chinese journals, he 
adds, don’t require the same level of critique 
and peer review as international publications. 

Luo says that Xu is one of only a few palae- 
ontologists in China to embrace cladistics — a 
process for determining evolutionary relation- 
ships by analysing the features that groups 
share. Western researchers and international 
journals have been using cladistics for more 
than two decades, but it has been slow to catch 
on in China. 

Within his own country, Xu crosses bounda- 
ries between the academic and commercial 
sectors. For example, he has forged a close rela- 
tionship with Xiaoting Zheng, the former head 


XING XU’S FEATHERED FRIENDS 


of a local state-owned gold mine who is now a 
keen amateur fossil collector, a budding palae- 
ontologist and director of the Tianyu museum. 
In his museum, Zheng has accumulated one of 
the largest assemblages of feathered dinosaur 
fossils in the world. Over the years, Xu has been 
teaching him what to look out for in his pur- 
chases and has analysed some of the acquisi- 
tions. The two make a formidable team. 


FEATHERS FLYING 

Last year’, Xu made a big splash with a speci- 
men from the Tianyu museum’ collection: a 
small feathered dinosaur that he named Xiao- 
tingia zhengi to honour Zheng. The creature 
had a shallow snout, a distinctive skull shape 
and other features that led Xu and his col- 
leagues to place it as a close relative of Archae- 
opteryx. That animal has long been regarded 
as the oldest known bird, but Xu and his 


Fossils found in Liaoning in northwestern China show that many dinosaurs in the late Jurassic and early 
Cretaceous periods had feathers. The exceptional specimens have transformed ideas about theropod 


dinosaurs and the birds that evolved from them. 


10cm 


XIAQTINGIA ZHENG! 

Late Jurassic (160 million to 145 million years ago) 
The 30-centimetre-long Xiaotingia had feathers 
and other features resembling those of 
Archaeopteryx, often considered the earliest bird. 
Xu has proposed that both belonged to a group 
of non-avian dinosaurs, closely related to but 
distinct from birds. 
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ANCHIORNIS HU 
Late Jurassic (160 million to 145 million years ago) 
An exquisitely preserved specimen of the 
dinosaur Anchiornis helped Xu and his 
colleagues pin down the timing of the transition 
from dinosaurs to birds. Its long feathers 
demonstrated how complex early plumage 
could be. 
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colleagues performed a cladistic analysis that 
knocked Archaeopteryx from its special perch 
on the bird lineage, relegating it to a different 
branch along with a host of other feathered 
dinosaurs. That study has met resistance from 
some other palaeontologists, who question 
the strength of the cladistic analysis and say 
that the evolutionary relationships will remain 
unclear until more early birds and their close 
relatives are discovered. 

The Liaoning fossils have led Xu to make 
other bold proposals about the origins of flight. 
The discoveries of Microraptor and Anchiornis, 
another four-winged dinosaur, led Xu to 
argue’ that the four-winged trait was not an 
evolutionary dead end, as had been previously 
assumed, but could actually have been the 
transitional step between dinosaurs and birds. 

The feathered dinosaur fossils have also pro- 
vided some of the first hard evidence for when 
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MICRORAPTOR GUI 
Early Cretaceous (145 million to 100 million years ago) 
Although not a bird, the tiny dinosaur Microraptor 
had feathers on its arms (see below) and on its legs, 
and it may have flown. 


YUTYRANNUS HUALI 
Early Cretaceous (145 million to 100 million years ago) | 
This 9-metre-long long predator provided evidence | 
that even some big dinosaurs had feathers. Three 


+ specimens were found — two juveniles and an adult 


— with feathers in various locations, including the | 
hip, neck and back. i 
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and why feathers evolved. “For most of the past 
century, the classic issue in feather evolution 
was that the fossil record told us essentially 
nothing,” says Richard Prum, who studies the 
evolution of birds at Yale University in New 
Haven, Connecticut. “What’s happened with 
the Liaoning formation has been a totally new 
chapter.” 

In the past, palaeontologists had presumed 
that when feathers first arose, they helped bird 
ancestors to fly. But on the basis of his discover- 
ies, Xu makes the controversial argument that 
most dinosaurs probably had at least a smatter- 
ing of plumage, which would mean that feath- 
ers originally served other functions, such as 
attracting mates or insulating against the cold. 


ROLE MODEL 

With his extraordinary track record, Xu brings 
to mind the prodigious US palaeontologists 
Othniel Marsh and Edward Cope, who dis- 
covered dozens of dinosaurs in the late nine- 
teenth century in a frenetic competition that 
became known as the bone wars. But whereas 
those Victorian fossil hunters made frequent 
errors, such as giving new names to species 
that had already been described, Xu is a care- 
ful researcher who does not rush into print. 
His published record of new species has rarely 
been challenged, says Mike Benton, a palae- 
ontologist at the University of Bristol, UK, 
who has analysed the accuracy of dinosaur 
researchers. 

Xu would like to see Chinese science as a 
whole become more careful. “Chinese culture 
isa problem for science because it’s not logical 
enough,’ says Xu during a trip this summer, 
as his driver gaily overtakes on the wrong side 
of the road. “Traditionally people 
don't like to criticize, either. For peer 
review you have to criticize in some 
way...” . He breaks off mid-sentence 
to answer a phone call in Mandarin 
for a few minutes, before resuming 
exactly where he left off“... but here 
in China we dont have a real peer- 
review system.” 

Another problem hanging over Chinese pal- 
aeontology is fakery. Xu is keenly aware of it. In 
2000, he helped’ to unmask one of the biggest 
hoaxes in a generation: a composite specimen 
named Archaeoraptor, made up of the upper 
body of an ancient bird and the tail of the 
dinosaur Microraptor. Scientists are getting 
better at spotting fakes, says Xu, but they do 
still crop up, because poor farmers know that 
they can sell the most unusual fossils to muse- 
ums or institutes for hefty sums. “We have the 
greatest resources in palaeontology now,’ says 
Zhonghe Zhou, director of the IVPP, “but on 
the other hand, the destruction of localities, 
the faking — those kinds of things are often 
the most severe. The law isn’t good enough” 

Xu worries about the future of his profes- 
sion, particularly the next generation of scien- 
tists. His current students aren't showing the 


DINOSAUR HUNTING GROUNDS 


Rich deposits of dinosaur fossils are scattered around China. Xing Xu 
has excavated or studied many of key finds emerging from those sites. 
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dedication that their boss would like. “They 
don't work as hard as me,’ he says. “Maybe I ask 
too much, maybe that’s my problem.” Qing- 
Jin Meng, director of the Beijing Museum of 
Natural History, says, “Excellent palaeontolo- 
gists [such as Xu] are hard to find” Part of the 
problem may be the globalization of Chinese 
palaeontology, he adds. “Many students who 
have great potential have gone to the United 
States and European countries to study.” 

Xu says that if only he could find the time, 
he would like to write articles about how to 
improve Chinese science. But so far he has 
published only one blog post in Mandarin. 


HE IS THE GO-TO MAN IN CHINA FOR 
ANYTHING PEOPLE WANT TO KNOW 


ABOUT DINOSAURS. 


“Honestly, I dort like it much. I'd rather do 
science,” he says. 

Xu’s packed schedule can be hard on his 
family — his wife Zhonghia Zhou, who is a 
secretary at the Institute for Geology and Geo- 
physics in Beijing, and their two boys, aged 7 
and 12. “My wife complains because the kids 
are growing up,’ confesses Xu. “She says they 
need a male example. And I thought, yeah, 
that’s important.” 

Soin the past couple of years he has tried to 
spend more time at home, helping with home- 
work, playing table tennis with his wife and 
taking his family on days out to Beijing’s parks. 
Even the director of the IVPP recognizes a 
candidate for burn-out when he sees one: “He 
should slow down a bit!” says Zhou. “You can't 
study everything — you need time for hob- 
bies.” To that end, Xu and Zhou sometimes 
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play badminton on the court installed in the 
entrance hall of the IVPP. 

It is unlikely that Xu’s hobbies will eat into 
his prodigious output too much. Back in his 
office in Beijing after the trip to Zhucheng, Xu 
rummages through the floor-to-ceiling cup- 
boards lining two walls. He pulls out slabs of 
rock, pointing out salient features and clues 
that he might have an unknown species on his 
hands. 

More than setting records by finding 
new creatures, Xu is interested in asking 
and answering questions about 
a far-gone era, when his country 
was filled with a dizzying array of 
feathered dinosaurs and birds. He 
is keen, for example, to continue 
exploring how non-avian dinosaurs 
developed feathers and whether the 
plumage differed from that of mod- 
ern birds. As he looks over the fos- 
sils in his office, Xu’s eyes glint with a blend of 
tiredness and excitement. 

Luo, who has watched Xu’s career take off, 
sees no end to the potential discoveries. “Fos- 
sils are silent,’ says Luo. “It takes an insightful 
palaeontologist to tell their story, and Xu Xing 
is a fantastic storyteller. = 


Kerri Smith is podcast editor for Nature in 
London. 
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wan Birney would like to create a printout of all the 
genomic data that he and his collaborators have been 
collecting for the past five years as part of ENCODE, the 
Encyclopedia of DNA Elements. Finding a place to put 
it would be a challenge, however. Even if it contained 
1,000 base pairs per square centimetre, the printout 

would stretch 16 metres high and at least 30 kilometres long. 
ENCODE was designed to pick up where the Human Genome 
Project left off. Although that massive effort revealed the blue- 
print of human biology, it quickly became clear that the instruc- 
tion manual for reading the blueprint was sketchy at best. 
Researchers could identify in its 3 billion letters many of the 
regions that code for proteins, but those make up little more than 
1% of the genome, contained in around 20,000 genes — a few 
familiar objects in an otherwise stark and unrecognizable land- 
scape. Many biologists suspected that the information respon- 
sible for the wondrous complexity of humans lay somewhere in 
the ‘deserts’ between the genes. ENCODE, which started in 2003, 
is a massive data-collection effort designed to populate this ter- 
rain. The aim is to catalogue the ‘functional’ DNA sequences that 
lurk there, learn when and in which cells they are active and trace 
their effects on how the genome is packaged, regulated and read. 
After an initial pilot phase, ENCODE scientists started apply- 
ing their methods to the entire genome in 2007. Now that phase 
has come to a close, signalled by the publication of 30 papers, 
in Nature, Genome Research and Genome Biology. The consor- 
tium has assigned some sort of function to roughly 80% of the 
genome, including more than 70,000 ‘promoter’ regions — the 
sites, just upstream of genes, where proteins bind to control 
gene expression — and nearly 400,000 ‘enhancer’ regions that 
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Institute in Hinxton, UK, who coordinated the data analysis for 
ENCODE. He says that some of the mapping efforts are about 
halfway to completion, and that deeper characterization of every- 
thing the genome is doing is probably only 10% finished. A third 
phase, now getting under way, will fill out the human instruction 
manual and provide much more detail. 

Many who have dipped a cup into the vast stream of data 
are excited by the prospect. ENCODE has already illuminated 
some of the genome’s dark corners, creating opportunities to 
understand how genetic variations affect human traits and dis- 
eases. Exploring the myriad regulatory elements revealed by the 
project and comparing their sequences with those from other 
mammals promises to reshape scientists’ understanding of how 
humans evolved. 

Yet some researchers wonder at what point enough will be 
enough. “I don't see the runaway train stopping soon,’ says Chris 
Ponting, a computational biologist at the University of Oxford, 
UK. Although Ponting is supportive of the project’s goals, he 
does question whether some aspects of ENCODE will provide 
a return on the investment, which is estimated to have exceeded 
US$185 million. But Job Dekker, an ENCODE group leader at 
the University of Massachusetts Medical School in Worces- 
ter, says that realizing ENCODE'’s potential will require some 
patience. “It sometimes takes you a long time to know how much 
can you learn from any given data set,” he says. 

Even before the human genome sequence was finished’, the 
National Human Genome Research Institute (NHGRIJ), the 
main US funder of genomic science, 


regulate expression of distant genes (see 

page 57). But the job is far from done, ENCODE 
says Birney, a computational biologist 
at the European Molecular Biology 
Laboratory’s European Bioinformatics 


Encyclopedia of DNA Elements 
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was arguing for a systematic approach 
to identify functional pieces of DNA. 
In 2003, it invited biologists to propose 
pilot projects that would accrue such 
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Scientists in the Encyclopedia of DNA Elements Consortium have applied 24 experiment 
types (across) to more than 150 cell lines (down) to assign functions to as many DNA 
regions as possible — but the project is still far from complete. 
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groups, which regulate gene 
expression. 


Open chromatin: areas in 
which the DNA and proteins 
that make up chromatin are 
accessible to regulatory 
proteins. 

RNA binding: positions 
where regulatory proteins 
attach to RNA. 


CELL LINES 


Tiers 1 and 2: 
widely used cell 
lines that were 
given priority. 


Tier 3: all other 
cell types. 


RNA sequences: regions that 
are transcribed into RNA. 


Every shaded box 
represents at least 
one genome-wide 
experiment run on 
a cell type. 


ChIP-seq: technique that 
reveals where proteins bind 
to DNA. 


Modified histones: histone 
proteins, which package DNA 
into chromosomes, modified 
by chemical marks. 


Transcription factors: 
proteins that bind to DNA 
and regulate transcription. 


information on just 1% of the genome, and help to determine which” 
experimental techniques were likely to work best on the whole thing. 

The pilot projects transformed biologists’ view of the genome. Even 
though only a small amount of DNA manufactures protein-coding mes- 
senger RNA, for example, the researchers found that much of the genome 
is ‘transcribed’ into non-coding RNA molecules, some of which are now 
known to be important regulators of gene expression. And although 
many geneticists had thought that the functional elements would be 
those that are most conserved across species, they actually found that 
many important regulatory sequences have evolved rapidly. The consor- 
tium published its results’ in 2007, shortly after the NHGRI had issued 
a second round of requests, this time asking would-be participants to 
extend their work to the entire genome. This ‘scale-up’ phase started just 
as next-generation sequencing machines were taking off, making data 
acquisition much faster and cheaper. “We produced, I think, five times the 
data we said we were going to produce without any change in cost,’ says 
John Stamatoyannopoulos, an ENCODE group leader at the University 
of Washington in Seattle. 

The 32 groups, including more than 440 scientists, focused on 
24 standard types of experiment (see ‘Making a genome manual’). 
They isolated and sequenced the RNA transcribed from the genome, 
and identified the DNA binding sites for about 120 transcription fac- 
tors. They mapped the regions of the genome that were carpeted by 
methyl chemical groups, which generally indicate areas in which genes 
are silent. They examined patterns of chemical modifications made to 
histone proteins, which help to package DNA into chromosomes and 
can signal regions where gene expression is boosted or suppressed. And 
even though the genome is the same in most human cells, how it is used 
is not. So the teams did these experiments on multiple cell types — at 
least 147 — resulting in the 1,648 experiments that ENCODE reports 
on this week***. 

Stamatoyannopoulos and his collaborators’, for example, mapped the 
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examined 13 of about 
60 known histone 
modifications and 
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transcription factors. 
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regulatory regions in 125 cell types using an enzyme called DNasel (see 
page 75). The enzyme has little effect on the DNA that hugs histones, 
but it chops up DNA that is bound to other regulatory proteins, such as 
transcription factors. Sequencing the chopped-up DNA suggests where 
these proteins bind in the different cell types. The team discovered around 
2.9 million of these sites altogether. Roughly one-third were found in only 
one cell type and just 3,700 showed up in all cell types, suggesting major 
differences in how the genome is regulated from cell to cell. 

The real fun starts when the various data sets are layered together. 
Experiments looking at histone modifications, for example, reveal pat- 
terns that correspond with the borders of the DNasel-sensitive sites. 
Then researchers can add data showing exactly which transcription 
factors bind where, and when. The vast desert regions have now been 
populated with hundreds of thousands of features that contribute to 
gene regulation. And every cell type uses different combinations and 
permutations of these features to generate its unique biology. This 
richness helps to explain how relatively few protein-coding genes can 
provide the biological complexity necessary to grow and run a human 
being. ENCODE “is much more than the sum of the parts’, says Manolis 
Kellis, a computational genomicist at the Massachusetts Institute of 
Technology in Cambridge, who led some of the data-analysis efforts. 

The data, which have been released throughout the project, are 
already helping researchers to make sense of disease genetics. Since 
2005, genome-wide association studies (GWAS) have spat out thou- 
sands of points on the genome in which a single-letter difference, or 
variant, seems to be associated with disease risk. But almost 90% of these 
variants fall outside protein-coding genes, so researchers have little clue 
as to how they might cause or influence disease. 

The map created by ENCODE reveals that many of the disease-linked 
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regions include enhancers or other functional sequences. And cell 
type is important. Kellis’s group looked at some of the variants that 
are strongly associated with systemic lupus erythematosus, a disease 
in which the immune system attacks the body's own tissues. The team 
noticed that the variants identified in GWAS tended to be in regulatory 
regions of the genome that were active in an immune-cell line, but not 
necessarily in other types of cell and Kellis’s postdoc Lucas Ward has 
created a web portal called HaploReg, which allows researchers to screen 
variants identified in GWAS against ENCODE data in a systematic way. 
“We are now, thanks to ENCODE, able to attack much more complex 
diseases,” Kellis says. 


ARE WE THERE YET? 

Researchers could spend years just working with ENCODE's existing 
data — but there is still much more to come. On its website, the Uni- 
versity of California, Santa Cruz, has a telling visual representation of 
ENCODE’ progress: a grid showing which of the 24 experiment types 
have been done and which of the nearly 

180 cell types ENCODE has now exam- 

ined. It is sparsely populated. A handful 

of cell lines, including the lab workhorses 

called HeLa and GM12878, are fairly well 

filled out. Many, however, have seen just one 

experiment. 

Scientists will fill in many of the blanks as 
part of the third phase, which Birney refers 
to as the ‘build out’ But they also plan to add more experiments and cell 
types. One way to do that is to expand the use of a technique known as 
chromatin immunoprecipitation (ChIP), which looks for all sequences 
bound to a specific protein, including transcription factors and modified 
histones. Through a painstaking process, researchers develop antibod- 
ies for these DNA binding proteins one by one, use those antibodies to 
pull the protein and any attached DNA out of cell extracts, and then 
sequence that DNA. 

But at least that is a bounded problem, says Birney, because there are 
thought to be only about 2,000 such proteins to explore. (ENCODE has 
already sampled about one-tenth of these.) More difficult is figuring out 
how many cell lines to interrogate. Most of the experiments so far have 
been performed on lines that grow readily in culture but have unnatu- 
ral properties. The cell line GM12878, for example, was created from 
blood cells using a virus that drives the cells to reproduce, and histones 
or other factors may bind abnormally to its amped-up genome. HeLa 
was established from a cervical-cancer biopsy more than 50 years ago 
and is riddled with genomic rearrangements. Birney recently quipped 
at a talk that it qualifies as a new species. 

ENCODE researchers now want to look at cells taken directly from a 
person. But because many of these cells do not divide in culture, experi- 
ments have to be performed on only a small amount of DNA, and some 
tissues, such as those in the brain, are difficult to sample. ENCODE 
collaborators are also starting to talk about delving deeper into how 
variation between people affects the activity of regulatory elements in 
the genome. “At some places there's going to be some sequence variation 
that means a transcription factor is not going to bind here the same way 
it binds over here,” says Mark Gerstein, a computational biologist at Yale 
University in New Haven, Connecticut, who helped to design the data 
architecture for ENCODE. Eventually, researchers could end up looking 
at samples from dozens to hundreds of people. 

The range of experiments is expanding, too. One quickly develop- 
ing area of study involves looking at interactions between parts of the 
genome in three-dimensional space. If the intervening DNA loops out 
of the way, enhancer elements can regulate genes hundreds of thousands 
of base pairs away, so proteins bound to the enhancer can end up inter- 
acting with those attached near the gene. Dekker and his collaborators 
have been developing a technique to map these interactions. First, they 
use chemicals that fuse DNA-binding proteins together. Then they cut 
out the intervening loops and sequence the bound DNA, revealing the 
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distant relationships between regulatory elements. They are now scaling 
up these efforts to explore the interactions across the genome. “This is 
beyond the simple annotation of the genome. It’s the next phase,’ Dek- 
ker says. 

The question is, where to stop? Kellis says that some experimental 
approaches could hit saturation points: if the rate of discoveries falls 
below a certain threshold, the return on each experiment could become 
too low to pursue. And, says Kellis, scientists could eventually accumu- 
late enough data to predict the function of unexplored sequences. This 
process, called imputation, has long been a goal for genome annotation. 
“T think there's going to be a phase transition where sometimes imputa- 
tion is going to be more powerful and more accurate than actually doing 
the experiments,” Kellis says. 

Yet with thousands of cell types to test and a growing set of tools with 
which to test them, the project could unfold endlessly. “We're far from 
finished,” says geneticist Rick Myers of the HudsonAlpha Institute for 
Biotechnology in Huntsville, Alabama. “You might argue that this could 

go on forever.’ And that worries some peo- 
ple. The pilot ENCODE project cost an esti- 
mated $55 million; the scale-up was about 
$130 million; and the NHGRI could award 
up to $123 million in the next phase. 
Some researchers argue that they have 
yet to see a solid return on that investment. 
For one thing, it has been difficult to collect 
detailed information on how the ENCODE 
data are being used. Mike Pazin, a programme director at the NHGRI, 
has scoured the literature for papers in which ENCODE data played a 
significant part. He has counted about 300, 110 of which come from 
labs without ENCODE funding. The exercise was complicated, however, 
because the word ‘encode’ shows up in genetics and genomics papers 
all the time. “Note to self? says Pazin wryly, “make up a unique project 
name next time around” 

A few scientists contacted for this story complain that this isn’t much 
to show from nearly a decade of work, and that the choices of cell lines 
and transcription factors have been somewhat arbitrary. Some also 
think that the money eaten up by the project would be better spent on 
investigator-initiated, hypothesis-driven projects — a complaint that 
also arose during the Human Genome Project. But unlike the genome 
project, which had a clear endpoint, critics say that ENCODE could 
continue to expand and is essentially unfinishable. (None of the scien- 
tists would comment on the record, however, for fear that it would affect 
their funding or that of their postdocs and graduate students.) 

Birney sympathizes with the concern that hypothesis-led research 
needs more funding, but says that “it’s the wrong approach to put these 
things up as direct competition” The NHGRI devotes a lot of its research 
dollars to big, consortium-led projects such as ENCODE, but it gets just 
2% of the total US National Institutes of Health budget, leaving plenty 
for hypothesis-led work. And Birney argues that the project's systematic 
approach will pay dividends. “As mundane as these cataloguing efforts 
are, you've got to put all the parts down on the table before putting it 
together,’ he says. 

After all, says Gerstein, it took more than half a century to get from 
the realization that DNA is the hereditary material of life to the sequence 
of the human genome. “You could almost imagine that the scientific 
programme for the next century is really understanding that sequence.” m 


Brendan Maher is a features editor for Nature. 
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The town of Times Beach in Missouri was evacuated in 1983 and later demolished after a dioxin spill. 


Rethink chemical 
risk assessments 


The US Environmental Protection Agency needs to 
speed up its risk analyses and address uncertainty, 
say George M. Gray and Joshua T. Cohen. 


he US Environmental Protection 
Agency (EPA) is under fire. Its flag- 
ship Integrated Risk Information 
System (IRIS), which develops risk values for 
human chemical exposure that are used by 
regulators and others, is being widely criti- 
cized for being too slow and scientifically 
flawed. The system needs an overhaul. 


Last year, for instance, the US National 
Academy of Sciences (NAS) castigated the 
EPA’s inadequate assessment of the health 
risks of formaldehyde’. Evaluations of other 
chemicals, including dioxin, have been 
equally controversial’. In December 2011, 
Congress directed the agency to improve its 
risk assessments and submit documentation 
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to the NAS for review (see go.nature.com/ 
xmeqyv). But the problems go deeper than 
the IRIS process. 

Two main challenges render the EPAs risk 
assessments inadequate for decision-making. 
First, they take years or even decades to 
conclude, meaning that many chemicals 
have never been examined. Second, their 
scientific credibility is often challenged. 
Peer reviewers have questioned the EPA's 
selective use of data and some assumptions 
that it has made to plug gaps in the scientific 
evidence. The NAS has recommended that 
the EPA better justify and quantify its risk- 
assessment assumptions. 

As scientists who have served at the EPA 
(G.M.G.) and participated in NAS reviews 
(J.T.C.), we believe that more is needed. 
The agency needs to fundamentally alter its 
approach to risk evaluation. First, it should 
offer faster summaries for more chemicals. 
Rough-and-ready estimates are often suffi- 
cient for policy-making, and are better than 
nothing. IRIS should include information 
from private groups and other governments, 
and apply available techniques for calculating 
the risks of chemicals for which there are few 
data. Second, the EPA needs to acknowledge 
that its risk estimates are uncertain by report- 
ing a range of plausible values, not just those 
that support its science-policy goals. 


ROOTED IN THE PAST 
Attitudes towards environmental regulation 
have changed since the agency was founded in 
1970. Less than a decade after Rachel Carson 
exposed the environmental damage caused 
by the pesticide DDT in her 1962 book Silent 
Spring, Americans wanting “freedom from 
risk”® embraced government protection. 
The EPA successfully addressed health 
threats posed by high-profile pollutants. A 
ban on leaded petrol spearheaded by the EPA 
in 1973 helped to reduce the level of lead in 
children’s blood by nearly an order of mag- 
nitude in the decades that followed. Other 
agency regulations introduced in the early 
1970s halved the levels of air pollutants such 
as sulphur dioxide and carbon monoxide. 
By the mid-1990s, the most glaring envi- 
ronmental problems had been dispatched 
and the EPA’ progress stalled. Although IRIS 
now counts 557 finished risk assessments in 
its repository, releases in each year since 
1995 have mostly been in single digits 
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> (see ‘Count down’). Risk assessments 
have become mired in controversy and 
extended review cycles. Worse, the EPA 
prioritizes revisions to assessments of chem- 
icals it has already evaluated, such as dioxin 
and mercury’, rather than evaluating crucial 
chemicals for the first time. 

The slow pace of IRIS threatens public 
health. Many people might assume that 
chemicals lacking an IRIS risk estimate are 
safer than those that have been assigned 
one, even if they are not. For example, the 
EPAs assessment of perchloroethylene, used 
in dry cleaning, has encouraged phasing 
out of the chemical. Some dry cleaners are 
switching to n-propyl bromide — for which 
there is no IRIS entry — despite evidence 
that it may pose a greater health risk than 
perchloroethylene’. 

Other difficulties arise from EPA efforts to 
characterize risk at ever-lower exposure lev- 
els, at which health effects are hard to observe. 
Reliant on animal experiments, the agency 
resorts to two critical assumptions: that any 
adverse health effects seen in rodents are 
mirrored in humans, and that the high doses 
used in the lab (to see an effect using a reason- 
able number of animals) can be extrapolated 
downwards, often by orders of magnitude, to 
reflect human population exposures. As the 
NAS has pointed out, the EPA often fails to 
justify the data used or explain how risks were 
estimated at low levels’”. 

In our view, the problem is the EPA’s use 
of assumptions that it claims are “public 
health protective”, which err on the side of 
overstating risk when data are lacking. Take 
dioxin, for example. In its assessment, the 
EPA assumed the worst case — that low 
levels of dioxin cause cancer — because 
that possibility cannot be ruled out. Yet 
other agencies, including the World Health 
Organization’, interpret the biological stud- 
ies of dioxin as suggesting that it is unlikely 
to cause cancer at low levels because of the 
way the chemical behaves within cells. 

Such inflated risk estimates can lead to 
overly stringent regulations and can scram- 
ble agency priorities because the degree of 
precaution differs across chemicals. For 
example, the EPA’s National-Scale Air 
Toxics Assessment from 2005 estimated a 
tenfold-higher cancer risk from outdoor 
air exposure to carbon tetrachloride (used 
in dry cleaning and asa solvent and refrig- 
erant) than from ethylene dibromide (a 
termite fumigant and former additive in 
petrol). Yet by taking on board the biological 
evidence, other agencies around the world 
have concluded the opposite — that carbon 
tetrachloride poses little risk because, unlike 
ethylene dibromide, it has a threshold for its 
carcinogenic action. 

The EPA intended that its air-toxicity 
results would help to set priorities for 
improving data in emission inventories, to 
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target risk-reduction activities more effec- 
tively and to identify pollutants and indus- 
trial sources of greatest concern. But its 
aggressive use of precautionary assumptions, 
even when they are scientifically unwar- 
ranted, instead misleads decision-makers. 


THE WAY FORWARD 

To its credit, the EPA has committed to 
adopting the NAS recommendations, 
including streamlining presentation of its 
analyses, making its toxicity evaluations 
more uniform and incorporating multiple 
data sets’. To become fit for purpose again, 
the agency must change its view of risk 
assessment. It should not see assessments as 
a search for scientific truth, but as a way to 
bring available information to bear on regu- 
latory and public-health decisions. 

The EPA should expand IRIS to include 
sources of information that are not cur- 
rently used, similar to the International 
Toxicity Estimates for Risk Assessment 
database (www.tera.org/iter). IRIS should 
report risk values developed by inter- 
national public-health agencies, by other 
health agencies in the United States and by 
private groups. 

The agency should integrate into IRIS 
information from its internal programmes, 
such as its Provisional Peer-Reviewed Tox- 
icity Value database, which contains more 
than 300 rapid-risk estimates developed 
to inform clean-up decisions at hazard- 
ous-waste sites. These estimates draw on 
information of varying quality, such as 
short-term toxicity tests, expert judgements 
and statistical models that predict a chemi- 
cal’s behaviour on the basis of its structure. 
The associated uncertainties should be 
reflected in the IRIS entry. 

In the longer term, the EPA should 
expedite its ongoing exploration of high- 
throughput screening methods. These can 
quickly ascertain a broad range of properties 
for a chemical, such as how readily it reacts 
with biological systems, and hence evaluate 
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potential health risks*. Once these methods 
and an understanding of how they feed into 
risk estimates are established, the information 
should be incorporated into IRIS. 

Fundamentally, the EPA should replace 
risk values that are built on science-policy 
assumptions with risk estimates that acknowl- 
edge underlying uncertainties. For instance, 
the agency could follow the example of the 
Intergovernmental Panel on Climate Change” 
and report a range of risks that correspond to 
different models. Users would then be able to 
see whether a value is sufficiently precise to 
support a particular course of action. 

Critics might argue that decision-makers 
will suffer ‘paralysis by analysis’ if confronted 
with a range of values rather than just one. 
Yet that is how it should be. The EPAs defini- 
tive values are illusions: they conceal uncer- 
tainty that cannot be resolved scientifically. 
Bringing conflicting value judgements into 
the open will enable honest debate and 
improve public health. m 
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Lessons for big-data projects 


To be successful, consortia need clear management, codes of conduct 
and participants who are committed to working for the common good, 
says ENCODE lead analysis coordinator Ewan Birney. 


he ENCODE consortium has for the 
| past five years been building up an 
encyclopaedia of functional DNA 
elements to be used as a reference for the 
scientific community. Today it publishes 
30 publicly accessible papers in three jour- 
nals — and all are connected to the processed 
analysis and raw data. This scientific under- 
taking has inspired new publishing models, 
such as the interweaving of topic threads 
between papers in different journals, and 
will, I hope, have a large impact on biology. 
The ENCODE project has delivered an 
incredible amount of information because 
of its sheer scale: more than 1,600 experi- 
ments on 147 cell types, including 235 anti- 
bodies or other assay protocols. The main 
paper has nearly 450 authors, working from 
more than 30 institutions. 


Because of its complexity (see page 46), 
the project could not have worked in the 
same way as one involving just one or two 
laboratories. Typically, scientists try to do 
the best science they can, with a limited set 
of collaborators, to earn grants and publi- 
cations to do what is best for science, their 
own careers and their own laboratories. 

This mindset doesn't work in consortium 
science. Instead, researchers must focus 
on creating the best data set they can. 
Maybe they will use the data, maybe they 
won't. What is important is the commu- 
nity resource, not individual success. This 
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requires a shift in perspective to a common 
goal of data output rather than publications. 
In turn, the success of consortium partici- 
pants must be measured at least as much by 
how their data have enabled science as by the 
insights they have produced. 


SUPPORTING THE COMMUNITY 

Big-biology consortia such as ENCODE, 
HapMap and the 1000 Genomes Project 
approach grand-scale work systematically. 
For example, they often take a ‘catalogue’ 
approach to create foundational resources 
rather than spotlighting areas of interest, and 
they use standardized methods, reagents and 
analysis schemes. The cost of these projects 
is justified by the breadth of science they sup- 
port — from genome-wide analysis down to 
smaller-scale, hypothesis-driven studies. > 
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>» — Has the big project had its day in the 
current era of ‘democratized’ data gathering? 
Certainly the drop in the price of data gath- 
ering has changed the game for all biology 
groups — and nearly always for the better 
(although there are of course new challenges 
in how to handle this). But the cheapness of 
data just extends the reach of large-scale 
projects; it does not alter the need to cre- 
ate systematic reference data sets. It is hard, 
if not impossible, to combine smaller data 
sets into reference data sets — as demon- 
strated by the initial chromosome maps in 
the Human Genome Project or the attempt 
to patch together collections of microarray 
data into an atlas of gene expression. 

Instead, a systematic data ‘skeletor’ is 
needed (for genomes, functional elements 
and variation, for example), around 
which smaller-scale experiments can add 
insight, colour and deeper understand- 
ing. ENCODE, BLUEPRINT and the 1000 
Genomes Project are examples of such skel- 
etons. The main products of ENCODE and 
similar projects are not just raw data, but also 
analysed intermediates that allow scientists 
to choose the level of detail at which they 
wish to start. 

I have been involved in consortia at 
various levels since 1999. In 2004, I became 
the coordinator of the ENCODE analysis. I 
have learned that consortia are difficult to 
make successful, because they involve people 
who might be competing with one another 
in another context. Getting competitors to 
work openly together towards a shared goal 
is not trivial. It relies on the good will of all. 

ENCODE has made it clear to me that 
effective consortium science requires all 
participants to buy into a structure, a code 
of conduct and the goal of high-quality data 
that are made accessible and usable to all sci- 
entists around the world. 


CLEAR STRUCTURE 

In my opinion, for large consortia to succeed, 
they need to create a structure that is trans- 
parent to everyone involved. 

This structure cannot follow the classic 
model of a single institute with a fixed 
hierarchy, or even a single ‘virtual’ institute 
agreed on by multiple partners. Instead, 
as happened for ENCODE, an open, peer- 
reviewed process should select and evalu- 
ate the partners who are best suited to a 
self-organized structure. And the structure 
should be flexible enough to change over 
time and to encompass multiple sources 
of funding. Considering each partner as 
an individual — rather than regarding the 
consortium as a single group — allows the 
addition of innovative participants from out- 
side the expected group. ENCODE probably 
would not have such a great depth of input 
from statistical groups had the project been 
funded by a single large grant. 


A diverse collection of scientists keeps 
the ideas fresh and the technology agile. It 
prevents ‘group think. For example, when 
there is a shift in technology, labs differ in 
their uptake. It would be damaging if every- 
one either committed too early to a poorly 
performing technology, or delayed uptake 
ofa successful one. Broad participation also 
connects the output to a much larger audi- 
ence worldwide. 

Large consortia do, however, need to avoid 
acommon pitfall: sharing the responsibility 
between too many principal investigators 
and senior postdoctoral fellows. This ren- 
ders decision-making 


“Consortium difficult. Without a 
science core structure, there 
involves: is arisk that members 
interaction will shift their focus 
between to their own interest 
humans, with areas at the expense 
all the social of the overall project. 
complications At the same time, 


these projects are too 
big and complex to 
be managed by one person, who is unlikely 
to have expertise in all the relevant areas. 
Initiatives that are piloted by one or a few 
principal investigators are more common 
in consortia working on diseases, and in my 
experience they often lack an operational 
project manager with a well-defined role. 

The ENCODE consortium had an 
internal structure that I believe was instru- 
mental to its success. It had a ‘spine’ of 
leadership comprising: scientifically aware 
project officers in the primary funding 
agency, the National Human Genome 
Research Institute at the US National Insti- 
tutes of Health; a few leading scientists with 
goals aligned to the consortium; and one or 
two scientific project managers hired inside 
the consortium who had a detailed under- 
standing of all the tasks and people involved. 
ENCODE'’s two key project coordina- 
tors (Ian Dunham and Anshul Kundaje) 
were funded for the lifetime of the project 
through a grant for which I was the princi- 
pal investigator. Successful consortia tend to 
have similar core structures, suggesting that 
this is a natural and effective way to organize 
such projects. 

The spine was able to resolve some of 
the most complex problems — both sci- 
entific and social — such as sorting out a 
quality-control disagreement between a 
data-production and data-analysis group. 
As in any endeavour that involves many 
individuals, communication channels are 
crucial for success. We should have explic- 
itly broadcast the existence of this spine 
both to the group and externally, to pro- 
vide more transparency with respect to how 
decisions were made. 

I also think that funding agencies 
should become more involved in shaping 


this entails.” 
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consortia. They should be flexible enough 
to shift their support from one group to 
another as needed, with adequate warning, 
and to withdraw funding from poorly per- 
forming or uncooperative partners — again 
with warning and with real consequences. 
Funding agreements often include such 
terms and conditions, but they are rarely 
used, perhaps because the threat of action 
is enough. And perhaps funding agencies 
feel uncomfortable, understandably, tak- 
ing on such a scientifically directive role. 
But the responsibility for the overall success 
of the project rests firmly with the funding 
agency, so it must feel empowered to inter- 
vene when necessary. 


CODES OF CONDUCT 

Consortium science involves interaction 
between humans, with all the social com- 
plications this entails. It happens across 
multiple sites and time zones, and the part- 
ners generally communicate electronically, 
rather than in person. Misunderstandings 
and clashes can arise because of cultural dif- 
ferences — at national, organizational and 
individual levels. 

To ensure that things run smoothly, rules 
are essential. An agreed-upon, written 
and publicly accessible code of conduct is 
extremely beneficial to large consortia, par- 
ticularly when they need to incorporate less- 
experienced partners. ENCODE had several 
written rules, on issues such as data release, 
and these were circulated internally. 

Such rules help to ensure that partners 
work within the goals of the consortium and 
do not (consciously or unconsciously) form 
a cartel that controls access to the data and 
analysis. An advisory board should regularly 
scrutinize internal and external partners for 
scientific impact, capacity to deliver and 
ability to interact effectively. Although I am 
confident that ENCODE did not restrict 
access to data or analysis through the rules 
of the funding agency, outside groups occa- 
sionally had that impression, and that is a 
failing I deeply regret. 

We should also have had written guide- 
lines on how to transfer work between 
groups, how to assign credit when papers are 
published and how and when project offic- 
ers should communicate, especially during 
times of conflict. Implicit rules of behaviour 
in consortia are often exploited by more 
experienced participants. 

Large consortia clearly benefit from an 
open-door policy that allows new, unfunded 
analysts to participate. And when these indi- 
viduals join the group or work with released 
consortium data, their analyses should be 
considered equally creditable and stigma- 
free relative to those performed by long- 
standing group members. 

That brings us to error-catching. Big 
projects generate errors and have a range 
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The ENCODE project involved hundreds 
of people from around the world, and a 
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of artefacts, so most researchers agree 
that data should be released to the larger 
community sooner rather than later. In 
ENCODE, we came to understand how 
time-consuming and involved quality con- 
trol at scale is. It was not until around half- 
way through the process that we were able 
to assess the experiments retrospectively 
with a formalized, centralized, quality- 
control system. Most experiments were 
exemplary; some had to be redone. A few 
had to be left out. 

The quality-control metrics and our 
final ‘call’ on whether a data set would be in 
or out is publicly accessible on the project 
website. Although important and biologi- 
cally correct, some experiments scored low 


on quality-control metrics because they 
had, for example, very few true sites where 
a protein bound to DNA. Other sources 
of error, such as that from a cross-reactive 
antibody, generated excellent scores — 
the antibody ‘worked’ because it bound to 
a particular class of molecule, but it also 
bound to many others that were not pre- 
dicted by the analysis. I wish now that we 
had accelerated the centralized quality- 
control process earlier, and been more 
open about this process. 

Although most errors are caught within 
a consortium before they are released, new 
analysis of public data inevitably uncovers 
more, particularly early in data production. 
When analysing such early data, external 


6 SE 


© 2012 Macmillan Publishers Limited. All rights reserved 


groups should report such errors promptly 
and without rancour. Although funders 
need to measure data quality in a stand- 
ardized way, during early data production 
consortia should really be judged not by 
absolute error rates, but by how quickly 
they can rectify reported errors. 

Funders have considerable influence in 
how raw and analysed data are released, 
and should design policies that maximize 
reuse. Early data-release policies focused 
on how data should be shared before publi- 
cation, with clumsy etiquette-based restric- 
tions on the first publications of global 
analysis, such as waiting for the authors 
who generated the data to publish their 
analyses before others can publish on the 
entire data set. These agreements are start- 
ing to show their age and a lack of clarity. 

The new era of analysis calls for a 
rethink, with more focus on the release 
of intermediate analysis throughout the 
project, so that the community can use the 
resource more fully during the project; the 
1000 Genomes consortium has done well 
in this regard. 


DOES IT DELIVER? 

The overall importance of consortia science 
can not be assessed until years after the data 
are assembled. But reference data sets are 
repeatedly used by numerous scientists 
worldwide, often long after the consor- 
tium disbands. We already know of more 
than 100 non-consortium publications that 
make use of ENCODE data, and I expect 
many more in the forthcoming years. 

Even if massive projects are successful, I 
feel strongly that the vast majority of fund- 
ing should still go to smaller, more creative, 
hypothesis-led science. 

For consortium participants, my call 
for more scrutiny, more clarity and more 
independent utilization of the data might 
seem restrictive, but I am confident that 
it will only benefit science and scientists 
in the long run. Even if large consortia 
receive only a small proportion of a dis- 
cipline’s funding, that can be a substantial 
amount when concentrated on a limited set 
of groups. If this is to continue, the entire 
community must be able to understand 
and use the resultant data. 

ENCODE is a foundational data set for 
understanding the human genome. I am 
proud of what we have delivered, but there 
are things we could have done better. I 
hope that other groups can learn from our 
experience. m 


Ewan Birney is lead ENCODE analysis 
coordinator and associate director of the 
European Molecular Biology Laboratory's 
European Bioinformatics Institute in 
Hinxton, UK. 

e-mail: birney@ebi.ac.uk 
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Life in stone 


Georgina Ferry enjoys a biography of a little-known 
Victorian woman who built monuments to nature. 


border between England and Scotland, 

stands a wholly original building: a syn- 
thesis of decorative motifs drawn from early 
nineteenth-century geology and natural his- 
tory with an ancient architectural style. This 
small church, completed in 1842, is the work 
ofa remarkable Victorian, Sarah Losh. 

As Jenny Uglow reveals in her intriguing 
biography, The Pinecone, Losh was by the age 
of 18 a competent mathematician, linguist 
and classicist, and knowledgeable about 
science, architecture, politics, philosophy, 
literature and art. Her nineteenth-century 


I the village of Wreay, just south of the 


biographer, Henry Lonsdale, wrote: “With 
powers to grapple with Euclid and algebra, 
she had but to give her attention to any sub- 
ject to master it.” She also had a clear sense 
of her own self-worth. Unlike writers of 
her time such as Jane Austen, Mary Shelley 
and Elizabeth Gaskell, she has not achieved 
worldwide recognition. Yet after her death, 
Dante Gabriel Rossetti hailed her as a genius, 
and her work foreshadowed the designs 
of John Ruskin, Alfred Waterhouse and 
William Morris. 

Books move, but buildings stay in one 
place. Losh, by building almost exclusively 
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in Wreay, ensured that 
beyond her immediate 
locality only specialists 
would come to know 
and admire her work. 
Panning outwards 
from this small, largely 
agricultural com- 
munity, Uglow uses 
Losh’s story to create 
a vibrant panorama 
of early nineteenth- 
century society that 


The Pinecone: 


The Story of Sarah 
Losh, Forgotten 
Romantic Heroine 


— Antiquarian, 
extends throughout Architect and 
the British Isles,across Visionary 
Europe and even to JENNY UGLOW 
the deadly passes of faber/Farrer, 
Afghanistan. Uslow Strauss and Giroux: 
oe ane 8 1 2012/2013 
is at ease ut e intel- 344 pp./352 pp. 
lectual environment  £¢20/$28 


of the era, which she 
researched fully for her book The Lunar Men 
(Faber, 2002). 

Losh’s family of country landowners pro- 
vided wealth, stability and an education 
infused with principles of the Enlighten- 
ment. Her father, John, and several uncles 
were experimenters, industrialists, religious 
nonconformists, political reformers and 
enthusiastic supporters of scientific, literary, 
historical and artistic endeavour, like mem- 
bers of the Lunar Society in Birmingham, 
UK. John Losh was a knowledgeable collector 
of Cumbrian fossils and minerals. His family, 
meanwhile, eagerly consumed the works of 
geologists James Hutton, Charles Lyell and 
William Buckland, which revealed ancient 
worlds teeming with strange life forms. 

Sarah’s uncle James Losh — a friend of 
political philosopher William Godwin, 
husband of the pioneering feminist Mary 
Wollstonecraft — took the education of his 
clever niece seriously. She read all the latest 
books, and met some of the foremost inno- 
vators of the day, such as the mathematician 
Isaac Milner and the physicist John Leslie. 

On their father’s death in 1814, Sarah and 
her beloved sister Katharine inherited sub- 
stantial property in Wreay and interests in 
their father’s successful alkali factory in the 
expanding industrial city of Newcastle. Their 
financial independence secure, neither ever 
married. Instead, they toured France, Ger- 
many and Italy together. In Italy, Losh saw 
for herself the simplicity of classical Roman 
and early medieval architecture. Once home, 
the sisters built a school and a house for the 
local schoolmaster based on simple, pre- 
Renaissance forms — the house was a copy 
of a Pompeiian cottage. After Katharine 
died, Losh embarked on her masterpiece. 

Brooking no argument from the Bishop 
of Carlisle, she offered to fund the com- 
plete rebuilding of Saint Mary’s, her village 
church, on the condition that she “be left 
unrestricted as to the mode of building it” > 
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> She ignored the contemporary craze 
for the Gothic, opting instead for a style 
modelled on the Romanesque: a simple 
rectangular building with a semicircular 
apse, and doors and windows topped 
with round arches. 

She made the building entirely her 
own by adding decorative carvings that 
combined rich pre-Christian symbolism 
with natural forms recently brought to 
light by fossil-hunters and naturalists. 
Executed by local craftsmen (and some- 
times Losh herself) working mostly in 
local stone and wood, these anticipated 
the artistic and architectural ideals set 
out by John Ruskin a decade after the 
church was completed. Lotus flowers, 
ammonites and butterflies embellished 
windows, doorways and capitals; Losh 
filled the high windows of the apse with 
the delicate forms of local fossil ferns 
cut from translucent sheets of alabaster. 
More than 30 years after she completed 
her church, and on a much grander scale, 
Alfred Waterhouse adopted a Roman- 
esque design decorated with flora and 
fauna for the Natural History Museum 
in London. Like Losh, he was inspired by 
visiting Italy and studying natural history, 
but Uglow cites no evidence that he knew 
of Losh’s work. 

Losh’s carvings often feature a pine- 
cone, an ancient symbol of regenera- 
tion and enlightenment. Uglow points 
out that the number of spirals winding 
up from the base of a pinecone always 
belongs to the Fibonacci series (run- 
ning 1, 2, 3, 5, 8 and so on, without end). 
James Hutton memorably concluded that 
he could find “no vestige of a beginning, 
no prospect of an end” in his studies of 
geological strata. Uglow helps us to see 
how Losh combined the architectural 
evidence of past human societies with 
contemporary invention and discov- 
ery, and how she conveyed, through her 
buildings, a sense of the eternal. 

Most of Losh’s personal papers and 
journals, like those of Jane Austen, were 
lost or destroyed, leaving the biographer 
to piece together her life from fragments 
gleaned elsewhere. Sarah Losh remains 
something of an enigma: a deeply reli- 
gious woman who built a church that 
contained no overtly Christian symbols; 
a devotee of ancient structures and a 
daughter of the Industrial Revolution; 
a fashionable beauty and an unmarried 
scholar and craftswoman. 

Sarah Losh chose to express herself in 
stone, rather than words. In Jenny Uglow, 
she has found a fine interpreter. m 


Georgina Ferry is a science writer and 
author living in Oxford, UK. 
e-mail: mgf@georginaferry.com 
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Canadian pianist Glenn Gould recorded Bach’s Goldberg Variations twice, in 1955 and 1981. 


TECHNOLOGY 


Baroque geekery 


Tim Boon assesses a take on the evolving technology 
behind recordings of J. S. Bach. 


aul Elie reveres the music of J. S. Bach 
P=: loves some recordings in par- 

ticular, such as Glenn Gould’s 1955 
rendition of the Goldberg Variations. In 
Reinventing Bach Elie sets out to show how 
technologies — especially developments in 
recording — have been central to the twen- 
tieth century’s experience of “the Master’s” 
music. 

The book's conceit is that the composer 
of the Two- and Three-Part Inventions was 
in some sense an inventor, and so peculiarly 
attuned to being reinvented — through the 
recording technologies of the past 100 years 
or so. And, as Elie shows, the power that 
recording offered, of enabling repeated lis- 
tening, also accelerated the rediscovery of 
Bach by generations of musicians. 

Each chapter takes a key recording, dwell- 
ing to different degrees on the technology 
used — disc, tape or digital. The chapters 
are arranged in roughly chronological order 
and range from takes by Albert Schweitzer 
and Leopold Stokowski on the famous 
Toccata and Fugue in D Minor to Gould’s 
two recordings of the Goldberg Variations 


R 2012 
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ReinventingBach and beyond. Along- 


PAUL ELIE side this, Elie threads 
Farrar, Straus and a biography of Bach, 
Giroux: 2012 


period-setting snap- 
shots of cultural 
events and an accu- 
mulating cast of Bach performers and 
recording artistes. 

Throughout, Elie describes the music, 
not with the technical terminology of the 
conservatoire, but with metaphor and sim- 
ile. His characterization of the Toccata and 
Fugue in D Minor, for instance, reads: “the 
pipes ring out once, twice, a third time. Then 
with along, low swallow the organ fills with 
sound, which spreads toward the ends of the 
instrument and settles, pooling there.” What 
he doesn't do, however, is meet the promise 
in the publisher's blurb to give us “a nuanced 
and intelligent examination of the technol- 
ogy” that has made the 


496 pp. £19.99, $30 


reinvention of Bach DNATURE.COM 
possible. For more on 
Elie draws on a_ recording music, 


wide range of pub- _ see: 
lished literature, and —_go.nature.com/xx3x22 


G. PARKS//TIME LIFE PICTURES/GETTY 


is insightful about the interplay between 
technological change and the development 
of both individual technique and the market 
for classical music. For example, he describes 
how Gould’s recordings of the Goldberg Vari- 
ations were polished as the pianist, holed up 
at a country retreat, repeatedly recorded and 
listened back to his own performances of the 
30 variations on the recently invented tape 
recorder. Elie also nicely depicts how the his- 
torically informed performance scene was 
stimulated by the arrival of the CD: the clar- 
ity of digital recording gave period-music 
specialists an opportunity to provide newly 
‘authentic performances. 

But the descriptions of technologies are 
less sure. Magnetic recording tape does 
not use silver oxides, as the book has it, but 
iron oxides. Elie also writes that Schweitzer 
recorded on cylinders, yet EMI always used 
discs. His description of a 1905 Victrola 
gramophone as having a needle convert- 
ing movements to electrical impulses reads 
oddly. This is an entirely acoustic device in 
which even the motor is clockwork; there 
were no electrical gramophones before the 
1920s. 

The book would also be stronger for a 
deeper and more integrated account of 
musical instruments. The hybrid instru- 
ment given to Schweitzer by the Paris Bach 
Society when he went as a missionary to 
Africa — enabling him to play in tropical 
conditions — is described merely as hav- 
ing “the features of a piano and an organ: 
two manuals, strings and hammers, ped- 
als. The inside of it was lined with zinc to 
ward off moisture in the tropics”. (This 
amazing-sounding machine can be seen 

in the Maison Albert 
“The pipes Schweitzer, the organ- 
ring out once, __ ist’s former home, in 
twice, athird Alsace, France.) Sim- 


time. Then, ilarly, Bach's possible 
witha long, involvement in the 
low swallow development of a new 


the organ fills instrument called the 
with sound.” Lautenwerck, a kind 


of keyboard-actuated 
lute, is glossed over in two brief paragraphs 
— a loss, given the emphasis on Bach as 
inventor. 

In the end, Reinventing Bach reads best 
as a sincere and compelling account of the 
author’s love of Bach’s recorded oeuvre. The 
passion shines through even though the 
technology is more marginal than prom- 
ised. And you may find yourself compelled 
to rummage through your CD shelves for the 
works — as I did — revisiting Bach in his 
multifarious reinventions. = 


Tim Boon is head of research and public 
history at the Science Museum in London, 
UK. 

e-mail: tim.boon@sciencemuseum.ac.uk 


Books in brief 


The Science of Human Perfection: How Genes Became the Heart 
of American Medicine 

Nathaniel Comfort YALE UNIVERSITY PRESS 336 pp. £25 (2012) 

In this provocative look at genetic medicine in the United States, 
medical historian Nathaniel Comfort argues that eugenics casts a 
long shadow over the field. He has researched records spanning a 
century, following the ever-evolving group of geneticists, eugenicists, 
psychologists, medics, public-health workers, zoologists and 
statisticians intent on using heredity to improve human life. Today’s 
hybridized discipline, he says, is noble in intent but rife with social 
and ethical questions centred on the ‘illusion of perfectibility’. 


Discord: The Story of Noise 

Mike Goldsmith OXFORD UNIVERSITY PRESS 336 pp. £16.99 (2012) 

You might pay to hear a jazz saxophonist let rip in a club, but go crazy 
if they practised next door. Sound in the wrong place is noise, points 
out science writer and former head of acoustics at the UK National 
Physical Laboratory Mike Goldsmith in this chronicle of cacophony 
and our attempts to control it. Starting with the nature of sound and 
its birth in the infant Universe, he runs through prehistoric noise, the 
beginnings of acoustical science in the Renaissance, the machine-led 
din of the Industrial Revolution, the clamorous twentieth century and 
today’s aural pollution from wind farms, underwater sonar and more. 


Unaccountable: What Hospitals Won’t Tell You and How 
Transparency Can Revolutionize Health Care 

Martin Makary BLOOMSBURY 256 pp. £19.99 (2012) 

Surgeon and health-policy specialist Martin Makary reveals US 
hospitals as battlegrounds between competence and chaos. Serious 
blunders — such as surgical tools being left in body cavities — are 
so common that a 2010 study reported that one-quarter of patients 
are harmed by medical mistakes. Among Makary’s mind-bending 
observations is how two doctors approached the removal of benign 
colonic polyps. One neatly excised the growth; the other removed 
half the colon. A powerful plea for openness in US health care. 


Why Geography Matters, More Than Ever 
Harm de Blij OXFORD UNIVERSITY PRESS 320 pp. £10.99 (2012) 
Where geopolitics is concerned, Harm de Blij says, it’s easy to hita 
plus ca change moment. This revised edition of his influential 2007 
book includes the rapid shifts and upheavals of the past five years, 
from the Arab Spring to the European Union’s economic wobbles. 
But de Blij’s original premise — that the geographical illiteracy 
prevalent in the United States seriously impedes coherent policy 
—is more relevant than ever. With power comes responsibility, and 
, Americans, he says, have an obligation to develop the geographer’s 
perspective on culture, politics, economics and the environment. 


c. Why 
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On the Edge: Mapping North America’s Coasts 


~~ Sa 7 Roger M. McCoy OXFORD UNIVERSITY PRESS 256 pp. £18.99 (2012) 
“ oy THE os Some 500 years ago, the edges of North America were as 
4 ED a < mysterious to Europe’s explorers as the Moon. Geographer Roger 
3 ik A McCoy recounts their voyages and cartographic efforts, starting with 
BS orike mel John Cabot and Martin Frobisher, and ending with Otto Sverdrup 
' “eontea's bs and Vilhjalmur Stefansson in the early twentieth century. The tales of 
a Mite y Atty = derring-do, brushes with death and brutal behaviour towards native 
ST Ty -_ Americans are interspersed with clear explanations of how, over 
a time, this multitude of mariners redrew the New World map. 
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A rendering of the upper level of the Museum of Mathematics, which opens in New York later this year. 


Q&A Glen Whitney 
Maths demystifier 


Mathematician Glen Whitney left a job in finance to set up the Museum of Mathematics 
(MoMath), which is due to open in Manhattan, New York, on 15 December. He wants to spread 
the word that mathematics is a beautiful discipline and all around us, from the geometry of soap 


bubbles to the algorithms that control traffic lights. 


How did you start out 
in mathematics? 
When I was young, 
I broke my collar- 
( bone playing soccer 
Jf) and fell in love with 
a maths problems while 
=" recuperating. I had 
a voracious appetite 
for mathematics in high school but when I 
went on to Harvard had no illusion that I was 
going to be one of the top researchers in the 
country. After teaching at the University of 
Michigan, I received an offer to try statistical 
trading at Renaissance Technologies in New 
York, a hedge fund run by mathematician Jim 
Simons. I decided to give it a try. I started out 
in the data group, then migrated into research- 
ing trade strategies and on to improving the 
research tools themselves. It was exciting and 
intellectually demanding, but I wanted to do 
something beneficial to society at large. 


Why did you focus on the public image of 
mathematics? 

The National Security Agency views the 
shortage of US mathematicians as one of 


the country’s biggest security threats. Yet 
you often hear people say, “I was always ter- 
rible at maths”. No one says that about read- 
ing. I believe this attitude stems primarily 
from the emphasis on rote procedures and 
people paying too little attention to making 
connections with everyday life and the world 
around them. We need a cultural institution 
to combat this prejudice. 


And why a museum? 

Many science museums are sparse on maths 
content. A lack of contemporary mathemat- 
ics exhibits means that one from 1960 is still 
housed at the New York Hall of Science and 
the Museum of Science in Boston, Mas- 
sachusetts. When kids see chemistry and 
physics exhibits but none on mathematics, it 
conveys a subtle but powerful message. The 
United States used to have one museum of 
mathematics — on Long Island, New York 
— but it was so small you had to gather ten 
people for it to open. It closed down in 2006 
and I realized that was an opportunity to 
create an environment for people to have 
seminal experiences with mathematical 
concepts, to show that maths is as much a 
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part of our society as the other sciences. 


What will a visitor find at your museum? 
Hands-on exhibits showing how mathemat- 
ics can be tangible, open-ended and fun. In 
the new museum, we will have exhibits on 
everything from the beautiful patterns cre- 
ated by video feedback to the probabilities of 
making a free throw in basketball. 


Have you debuted any of your exhibits? 

Yes, in a travelling exhibition, the Math Mid- 
way, which has appeared at science museums 
and festivals across the country, and will con- 
tinue to tour beyond the museum opening. 
Its iconic exhibit is a square-wheeled tricycle 
that rides smoothly over a surface of inverted 
catenary curves calculated to keep the axles 
of the tricycle level as the corners of the wheel 
rotate, which seems to give people the sense 
that maths can make the impossible possi- 
ble. Another exhibit is a plane of laser light 
that shows all of the possible cross-sections 
of translucent three-dimensional solids. Visi- 
tors can rotate a cube to learn that it can be 
sliced to yield not just squares and triangles, 
but trapezoids, rhombuses and even a regular 
hexagon that cuts through all six faces. 


What else are you doing until the museum 
opens? 

MoMath holds a monthly lecture series 
called Math Encounters in which we strive 
to show unexpected ways that maths touches 
everyday life — such as in the geometry of 
soap bubbles. Upcoming presentations 
include a talk about the maths of sport, and 
one on the maths of origami. 


What sets MoMath apart from other 
mathematics outreach and education efforts? 
Besides the fact that MoMath will be the 
only museum in North America devoted 
specifically to mathematics, there are a 
few distinctive aspects to its approach: a 
focus on physical interaction, especially 
whole-body involvement; an effort to show 
as broad a spectrum of the world of math- 
ematics as possible, not tied to any specific 
curriculum; and an emphasis on giving peo- 
ple the experience of the “Aha!” moment of 
discovery. 


You also run mathematical walking tours in 
Manhattan. What do those involve? 

I talk about the algorithms used to con- 
trol traffic lights, the mathematical issues 
involved in keeping the subway running, 
the symmetry of the mouldings on the sides 
of buildings and the unusual geometry that 
gives gingko trees their distinctive shape. 
There are deep connections to music, art 
and finance. If you give mea route, I'll make 
a tour. There is maths everywhere. 


INTERVIEW BY JASCHA HOFFMAN 


MUSEUM OF MATHEMATICS 


N. HIGGINS 


Correspondence 


NASA bids are not a 
popularity contest 


You recently conducted an online 
popularity poll of three proposals 
competing for selection as the 
next NASA Discovery Program 
mission (Nature http://doi. 
org/h79; 2012). In my view, the 
concept and execution of this 
poll demeans Nature and belittles 
what is at stake. 

Worse, there are indications 
that the poll could have been 
manipulated. Voting for one 
particular mission occurred in 
alarge burst on a single day. It 
is immaterial whether this was 
caused by the mission teams 
enlisting many supporters to 
vote quickly, or by people who 
worked out an easy way to vote 
multiple times. The point is that 
the results are not meaningful. 

Popularity contests are not the 
way to choose among scientific 
alternatives. Although public 
interest needs to be taken 
into account when spending 
taxpayers’ money, selecting 
a mission should ultimately 
depend on its scientific merit and 
technical feasibility. 

NASAs missions have a 
track record of exciting the 
public anyway, with web hits 
for different missions leading 
to server saturation during key 
events. The likely effectiveness 
of each mission’s outreach 
programme needs to be 
evaluated by looking carefully 
at the large, detailed proposals 
submitted by each mission team. 
Michael F. AHearn University 
Park, Maryland, USA. 
mahearn@mac.com 
Competing interests declared; see 
go.nature.com/Iaqferq. 


Tourism ban won’t 
help Indian tigers 


The Indian Supreme Court’s 
temporary injunction against 
tourism in core areas of tiger 
reserves could place the animals 
at greater risk of poaching if it 
becomes permanent, by reducing 
revenue for park management 


(Nature 488, 10; 2012). The 
injunction has now been 
extended until 27 September. 

Most of the reserves with the 
highest numbers of tigers and 
tourists are in the state of Madhya 
Pradesh. In 2010-11, the state’s 
35 parks received US$17.1 million 
from government sources. Five 
tiger reserves generated most 
of the $2.8 million obtained 
from tourism. In 2011-12, 
Bandhavgarh reserve received 
$1.2 million in tourist revenue 
and almost the same amount 
from government sources. 
Tourism therefore yields 25-50% 
of tiger conservation funds in 
Madhya Pradesh, safeguarding up 
to 130 tigers. 

Different management 
strategies would be more effective 
in overcoming conservation 
concerns stemming from 
disruptive tourist behaviour. 

Ralf C. Buckley Griffith 
University, Australia. 
r.buckley@griffith.edu.au 

H. S. Pabla Madhya Pradesh, 
India. 


Tighten up Japan’s 
stem-cell practices 


Japan has bioethical regulations 
and clinical guidelines in place 
for experimental stem-cell 
therapies and for stem-cell-based 
pharmaceuticals. Asa forensic 
pathologist who has worked 

ona patient who died after 
mesenchymal stem-cell therapy 
in Japan, I am aware that other 
patients receiving this treatment 
have developed serious and even 
fatal complications. These cases 
indicate that Japan's regulatory 
infrastructure needs to be more 
strongly enforced. 

Reaction in Japan to these 
cases has been minimal. This 
contrasts with the tough 
approach of the US Food and 
Drug Administration, which 
led to the prompt prosecution 
of clinicians and companies 
involved in similar cases in 
Colorado and Texas (see Nature 
477, 377-378; 2011). 

Japan’s Investigative 


Commission for Institutional 
Framework in Regenerative 
Medicine recommended 
establishing a punitive system 
for physicians and clinics 
practising unethical activities, 
but its 2011 report made no 
mention of such plans. The 
country’s specialist medical 
organizations should push for 
government collaboration if an 
effective disciplinary system is to 
be established (E. Dolgin Nature 
Med. 16, 495; 2010). 

The Japanese Medical Ethics 
Committee, for example, 
needs to work more like the 
UK General Medical Council, 
which does not depend on the 
country’s judiciary system to 
exercise its powers. 

The Japanese Society for 
Regenerative Medicine and the 
International Society for Stem 
Cell Research should collaborate 
with Japan’s health ministry to 
establish a system to prevent 
further stem-cell-related deaths. 
Hiroshi Ikegaya Kyoto 
Prefectural University of 
Medicine, Kyoto, Japan. 
ikegaya@koto.kpu-m.ac.jp 


Avoid constructing 
wind farms on peat 


Scotland’s government is 
planning to build large-scale 
wind farms to reduce carbon 
emissions from electricity 
production, some of which 
could be situated on peatlands. 
We contend that wind farms 
on peatlands will probably not 
reduce emissions, unlike those 
on mineral soils. 

Wind farms are often located 
in upland areas because most 
of these are windy, distant from 
residential areas and of low 
agricultural value. Peatlands are 
prevalent in UK uplands and are 
richer in carbon than mineral 
soils because peats are formed 
from decomposing wet vegetable 
matter. Peatlands therefore have 
a higher net carbon loss when 
drained for construction. 

The UK wind industry uses a 
method we and our colleagues 


developed to estimate carbon 
emissions (D. R. Nayak et al. 
Mires Peat 4, 9; 2010). On this 
basis, and assuming current 
emission factors for electricity 
generation, our previous work 
argued that most peatland sites 
could save on net emissions if 
peat is not drained and if sites are 
restored after construction. 
However, emissions factors are 
likely to drop significantly in the 
future owing to reduced fossil- 
fuel use in electricity generation 
(see go.nature.com/Inowou). As 
aresult, peatland sites would be 
less likely to generate a reduction 
in carbon emissions, even with 
careful management. Unless the 
volume of peat excavated can 
be significantly reduced relative 
to energy output, we suggest 
that construction of wind farms 
on non-degraded peats should 
always be avoided. 
Jo Smith, Dali Rani Nayak, Pete 
Smith University of Aberdeen, UK. 
jo.smith@abdn.ac.uk 


Improve sanitation 
on India’s railways 


A good place to start with India’s 
problems of poor sanitation (see, 
for example, Nature 486, 185; 
2012) would be the country’s 
150-year-old railway network, 
which carries 30 million 
passengers every day. Hygienic 
sanitation technologies have yet 
to be installed in all passenger 
coaches. 

The basic lavatory design 
throws excreta on to the open 
railway tracks. This system 
risks spreading pathogens and 
parasites to distant locations. 

One solution would be to 
install small biogas plants on 
trains or at stations. These would 
generate revenue — from excreta 
— that could be used to employ 
cleaning and disposal squads. 
Abhishek Sharma, 

M. K. Unnikrishnan Manipal 
University, Karnataka, India. 
abhisheksharma0991@gmail.com 
Ankush Madaan McGill 
University, Montreal, Quebec, 
Canada. 
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OBITUARY 


Martin Fleischmann 


(1927-2012) 


Pioneering electrochemist who claimed to have discovered cold fusion. 


Ithough a final reckoning should not 
A® genuine achievements be over- 

shadowed by errors, the blot that 
cold fusion left on Martin Fleischmann’s 
reputation is hard to expunge. 

Fleischmann, who died on 3 August at the 
age of 85 after illness related to Parkinson's 
disease, heart disease and diabetes, was the 
first to observe enhanced Raman emis- 
sion from molecules at surfaces, now the 
basis of a spectroscopy technique. And he 
developed ultramicroelectrodes, used as 
sensitive electrochemical probes. 

Nonetheless, he is best known for his 
claim in 1989 to have initiated nuclear 
fusion in bench-top apparatus. The 
‘cold fusion’ debacle provoked bitter 
disputes that reverberate today. Along 
with polywater and homeopathy, cold 
fusion is now regarded as one of the most 
notorious cases of what chemist Irving 
Langmuir called pathological science: 
“the science of things that aren't so”. 

Cold fusion was not really an aber- 
ration for Fleischmann, but an extreme 
example of his willingness to suggest bold 
and provocative ideas, to take risks and to 
make imaginative leaps that could some- 
times yield a rich harvest. 

Fleischmann was born in Carlsbad in 
Czechoslovakia in 1927. His father was of 
Jewish heritage, and, just before the German 
invasion, his family fled to the Netherlands 
and then to England. Fleischmann studied 
chemistry at Imperial College London and, 
after a PhD in electrochemistry, moved to 
Newcastle University, UK. In 1967 he was 
appointed as the Faraday Chair of Chemistry 
at the University of Southampton, UK. 

In 1974, Fleischmann and his co-workers 
observed unusually intense Raman emis- 
sion (scattered light shifted in energy by 
interactions with molecular vibrational 
states) from organic molecules adsorbed 
on the surface of silver electrodes. Although 
the enhancement mechanism is still not 
fully understood, surface-enhanced Raman 
spectroscopy has become a valuable tool for 
investigating surface chemistry. 

Around 1980, Fleischmann and chemist 
Mark Wightman independently pioneered 
the use of ultramicroelectrodes just a few 
micrometres across, which can be used to 
study electrode processes that are otherwise 
inaccessible, for example at low electrolyte 
concentrations. In 1985, Fleischmann was 
elected a fellow of Britain’s Royal Society. 


The cold fusion experiments arose out 
of Fleischmann’s long-standing interest in 
hydrogen surface chemistry on palladium. 
Hydrogen molecules adsorbed onto palla- 
dium can diffuse into the metal lattice, mak- 
ing palladium a ‘sponge’ that soaks up large 


amounts of hydrogen. Very high pressures 
can build up — perhaps, Fleischmann spec- 
ulated, high enough to fuse hydrogen nuclei. 

Fleischmann’ retirement from Southamp- 
ton in 1983 freed him to conduct self-funded 
experiments at the University of Utah in Salt 
Lake City with his former student Stanley 
Pons. They electrolysed solutions of lithium 
deuteroxide, collecting deuterium at the 
palladium cathode, and claimed to measure 
more heat output than the energy fed in — 
a signature, they said, of deuterium fusion 
within the electrode. One morning, they 
found that apparatus left running overnight 
had been vaporized and the fume cupboard 
destroyed. They believed it was the result of 
a violent outburst of fusion. 

Not until 1989 did Fleischmann, Pons 
and their student Marvin Hawkins make 
a move to publish their data. Finding that 
they were in competition with a team led 
by physicist Steven Jones at Brigham Young 
University in Provo, Utah, Fleischmann and 
Pons initially accused Jones of stealing their 
ideas. But the groups agreed to coordinate 
their announcements and to submit papers 
simultaneously to Nature on 24 March 1989. 
Yet Fleischmann and Pons pre-empted that 
arrangement, rushing a second paper to 
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the Journal of Electroanalytical Chemistry, 
organizing a press conference on 23 March 
and faxing their manuscript to Nature the 
same day without telling Jones. 

The rest, as they say, is history, told for 
example in Frank Close’s Too Hot To Handle 
(W.H. Allen, 1990). Fleischmann and 
Pons’s announcement shocked the world. 
Chemists had apparently, at minuscule 
expense, solved the fusion problem that 
physicists had been working on for dec- 
ades. In the attendant flurry, Fleischmann 
and Pons professed to be too busy to 
address reviewers’ comments and with- 
drew their Nature paper; Jones's account 
was eventually published (S. E. Jones et al. 
Nature 338, 737-740; 1989). Despite 
sporadic claims to the contrary, no 
comprehensive attempt at replication 
produced any confirmation of fusion. 

Indeed, it was a lack of reproducibility 
that finally put paid to the cold fusion 
idea. More bad behaviour followed: Fleis- 
chmann refused to describe crucial con- 
trol experiments; Pons's lawyer threatened 
to sue a Utah physicist who reported in 
Nature (see M. H. Salamon et al. Nature 
344, 401-405; 1990) that he was unable 
to replicate the work. The University of 
Utah sought to capitalize on events, throw- 
ing US$5 million at a ‘National Cold Fusion 
Institute’ that closed two years after it opened. 

Fleischmann and Pons moved to France 
to continue their work with private fund- 
ing, but later fell out. The biggest casualty 
of cold fusion was electrochemistry itself, 
suddenly seeming to be exposed as a morass 
of charlatanism and poor technique. That 
was unfair: some of the most authoritative 
(negative) attempts to replicate the results 
were conducted by electrochemists. 

Fleischmanns tragedy was Shakespearean, 
not least because he was a sympathetic char- 
acter: resourceful, energetic, inventive and 
remembered warmly by collaborators. As 
Linus Pauling and Fred Hoyle experienced, 
once you have been proved right against the 
odds, it becomes harder to accept the possi- 
bility of error. To make a mistake or a prema- 
ture claim, even to fall prey to self-deception, 
is a risk any scientist runs. The test is how 
one deals with it. = 


Philip Ball is a writer based in London and 
was a physical-sciences editor at Nature at 
the time of the cold fusion publications. 
e-mail: p.ball@btinternet.com 
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A single cog ina 
complex machine 


If they remain vigilant, early-career researchers can reap 
benefits from taking part in big international projects. 


BY SARAH KELLOGG 


atherine Luria has little doubt about 
( the benefits of participating in a big, 

international collaboration. Luria, a 
marine microbiologist beginning her third 
year of graduate study at Brown University 
in Providence, Rhode Island, is examining 
how changes in sea-ice coverage and blooms 
of phytoplankton affect bacterial diversity 
from season to season. She has literally gone 
to the ends of the Earth to join a collaboration: 
the Palmer Antarctica Long Term Ecological 
Research (LTER) project on the western coast 
of the Antarctic Peninsula. 

Luria will return to Antarctica this month, 
and several more times over the next two years, 
taking a week to travel there to spend two 
months with about 25 researchers and another 
dozen support staff involved in LTER. While 
there, she will characterize the water column, 
collect water samples and measure bacterial 
and phytoplankton abundance and bacterial 
production in the lab. She will examine micro- 
bial growth rates, physiology and community 
composition under different conditions. 

“It's a huge networking opportunity at this 
stage in my career,’ says Luria. Thanks to the 
collaboration, she will be able to work with 
many more measurements than she would have 
on her own. “What has proved to be especially 
helpful is having access to data,’ she says. “Sud- 
denly I'm able to dip into this pool of high-qual- 
ity, curated data going back a decade or more. I 
have the ability to get more meaningful results. 
It’s not data from a snapshot of when your grant 
just happened to be funded.” 

High-profile international research projects 
can bring together hundreds, ifnot thousands, 
of scientists. Joining one is no guarantee of pro- 
fessional success for an early-career researcher, 
but it does provide an exceptional environment 
for learning, and access to crucial data and 
networking opportunities that can advance 
personal research and open professional doors. 

Team science practised on a huge scale not 
only yields ground-breaking results, but can 
also establish and fortify careers, as research- 
ers have found in ventures such as the Human 
Genome Project; the ATLAS particle-physics 
experiment at the Large Hadron Collider at 
CERN, Europe's particle-physics laboratory 


Encyclopedia of DNA Elements 
{ nature.com/encode 
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> near Geneva in Switzerland; or the Encyclo- 
pedia of DNA Elements (ENCODE), a project 
to define the functional elements of the human 
genome (see page 45). 

“Research careers are built inside big 
collaborations, and a measure of the success of 
university-based research groups is the num- 
ber and quality of prominent positions within 
collaborations that are held by the group’s 
members,” says Ricardo Gongalo, a particle 
physicist at Royal Holloway, University of Lon- 
don, who has worked on ATLAS. 

Positions in big consortia are highly sought 
after, but drawbacks to participating include 
limited access to principal investigators; 
constant jockeying for recognition; the pres- 
sure to subjugate personal research to elevate 
project research; the 


risk of getting lost in “These are 
long lists of authors incredibly 
on publications; exciting and 
and the difficulty of | lmportant 
distinguishing indi- projects, and 
vidual work from they’re seen 
group work. Having asthe future 
somanypeopleina of science by 
project “impliesalot some.” 


of politics, different 

ways of behaviour that affect our interaction, 
many rules’, says Patricia Conde Muifio, a 
physicist at the Laboratory of Instrumenta- 
tion and Experimental Particle Physics in Lis- 
bon, who worked on HERA-B, an experiment 
at the DESY particle accelerator in Hamburg, 
Germany, that included 32 institutes and 250 
collaborators from 13 countries. “One thing 
that sometimes is complicated is the internal 
competition. This is stronger in the physics 
groups, where there are literally hundreds of 
people trying to do the same thing as you,’ she 
adds. 


VALUE ADDED 

Veterans of consortia say that it is crucial for 
young scientists to consult experienced inves- 
tigators when considering whether to join 
a project. They should weigh their research 
objectives and career goals, and assess how 
their strengths and weaknesses might be 
elevated or strained on a high-profile project. 
Although it is impossible to know how indi- 
vidual graduate students or postdocs will fare 
in such intense environments, it is important 
for them to go into projects with their eyes 
open to potential challenges. Those who don't 
proactively seek to develop their skills and net- 
work with established researchers may end up 
being little more than Anonymous Author 
Number 16 ona 40-author publication. 

The search for a high-profile collabora- 
tion begins most effectively with a review of 
personal career goals and how best to achieve 
them. Large consortia often represent just one 
step on a long career path. Young scientists 
can use self-assessment tools and resources to 
look at their core competencies and to evaluate 


long-term goals to see how a large project 
could match their aims. 

Armed with this knowledge, postdocs 
should talk to trusted faculty members or 
mentors, and seek out scientists from the col- 
laboration who are speaking or presenting 
posters at conferences. These people can alert 
the young scientist to research opportunities 
and provide key contacts to enable them to 
visit labs and meet principal investigators. 
The aim is to find the project that best fits the 
young researcher's professional interests and 
personal circumstances, and networking is the 
most efficient way to do that (see “Look before 
you leap’). 

Joining a high-profile collaboration opens 
the door to research and colleagues that may 
previously have been out of reach, while also 
providing the rare opportunity to explore 
cutting-edge research in a competitive and 
well-funded environment. The intimate col- 
laborations of smaller-group research are lost, 
but access to international experts gives young 
researchers great opportunities at this crucial 
time in their careers. 

High-profile projects also give researchers 
a chance to learn new methods and processes 
from international colleagues who bring very 
different approaches to the scientific enter- 
prise. “I think this brings enrichment, and 
hopefully you're able to pick the best from 
each and have a more powerful research team,” 
says Teresa Fonseca Martin, a former particle 
physicist who spent seven years at ATLAS (she 
left this year to become a school teacher). “It is 
true that different cultures have different ways 


of working, but by paying a bit of attention, it is 
easy to learn about it and work with it?” 

Many of these opportunities involve learn- 
ing softer skills, such as professional etiquette, 
leadership and management, communication 
and networking, and how research is con- 
ducted. These can be important for junior 
researchers who may be operating outside 
their home country for the first time and have 
had little contact with scientists from other 
countries. International consortia, says Fon- 
seca Martin, also provide opportunities to 
develop a global network of colleagues and 
friends, as well as a chance to learn about the 
cultures of different countries. 


FIGHTING ANONYMITY 

A big, prestigious team-science effort can not 
only boost a career but also sink one — or at 
the very least, waste the time of an early-career 
researcher. In particular, benefits can be offset 
by a numbing anonymity, especially for par- 
ticipants on the lowest rungs of the research 
ladder. The number of institutes and individual 
scientists involved turns large consortia into 
complex ecosystems that must be negotiated, 
whether researchers are trying to get credit for 
their lab work or attempting to stand out in 
a long list of names on a publication. Indeed, 
Ewan Birney, ENCODE’ analysis coordina- 
tor at the European Bioinformatics Institute 
in Hinxton, UK, argues that the aims of indi- 
vidual participants in ENCODE and other big 
collaborations shift, from striving for excellent 
science that leads to publication and career 
success, to striving for maximum data output 


WHAT TO EXPECT 


Look before you leap 


Early-career researchers in high-profile, 
international projects often struggle with 
how to stand out in a crowded field of 
graduate students and postdocs. Here are 
some tips to consider before — and after — 
joining a large project team. 

@ Seek advice about potential projects and 
principal investigators from knowledgeable 
advisers and researchers who have been 
associated with similar projects. 

@ Assess your personal and professional 
interest in the research, including whether 
the project will advance your career. 

@ Review the potential laboratory and 
research locations. 

@ Find out whether the principal 
investigator provides the mentorship and 
support you want. 

@ Seek opportunities for first-authorship on 
your own work within the project by carving 
out a special niche in the research. 

@ Look for chances to co-author 
publications with the principal investigator. 
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@ Volunteer to take on administrative tasks 
for the project, such as writing papers, 
assisting in interviews and arranging 
meetings. This will help you to get your name 
recognized and acquire leadership skills. 

@ Accept opportunities to co-supervise 
PhD students with the principal investigator 
on discrete research projects within the 
collaboration. 

@ Try to discover something new in the 
research or to use a new technique that 
advances the project. 

@ Schedule regular meetings with the 
principal investigator to update him or her 
on any progress. 

@ Build good relationships with other early- 
career researchers on the project and set 
up meetings or web seminars to exchange 
information about their research. 

@ Look for chances to work at the main 
project site as well as at your home 
laboratory to raise your profile with senior 
investigators. S.K. 


N. MURGAI 


Marine microbiologist Catherine Luria is part of 
a major ecology consortium in Antarctica. 


in the hope of contributing as much as pos- 
sible to a community resource — usually a 
big data set (see page 49). 

“Certainly there’s an allure to a big pro- 
ject, but there’s also a clear career risk of 
being lost in a very large crowd,” says Julie 
Klein, who studies interdisciplinary teams at 
Wayne State University in Detroit, Michigan. 
“These are incredibly exciting and important 
projects, and they’re seen as the future of 
science by some.’ They are also massive and 
unruly, she adds, in terms of the competition 
for attention. “It is often difficult to find one’s 
place in a collaboration of 3,000 scientists,” 
agrees Goncalo. “At first it seems that every 
good idea you come up with has already been 
tried by someone else.” 


STAND OUT IN THE CROWD 
Nearly ten years after starting work on 
ENCODE, Jason Lieb, a biologist at the Uni- 
versity of North Carolina at Chapel Hill and 
director of the Carolina Center for Genome 
Sciences at the university, says that standing 
out in a large team often means taking on 
extra work. He recommends that new mem- 
bers of the team improve their standing with 
the principal investigator by taking on extra 
roles, such as assisting in writing papers, hir- 
ing graduate students and scheduling group 
activities, and perhaps splitting their time 
between the large project and a smaller one 
in their home lab, with the aim of writing an 
independent paper with the principal inves- 
tigator. Experienced postdocs say that devel- 
oping leadership skills also helps a researcher 
to get noticed. 

Another potential downside for the young 
scientist is the administrative effort required 
to operate these vast projects. For example, 


the scale of ATLAS, which includes about 
3,000 physicists, has resulted in the develop- 
ment of an unhealthy, sluggish bureaucracy, 
says Fonseca Martin. These projects “don't 
necessarily get the best out of the people, 
and they sometimes make difficult the rec- 
ognition of people’s achievements and con- 
tributions,” she adds, referring to assigning 
authorship and opportunities for promotion. 
Sometimes, says Fonseca Martin, a research- 
er’s management abilities can become more 
important than their scientific ones. 

Major collaborations often require much 
logistical effort, such as organizing meetings 
and conferences, notes Lieb. “People are 
tasked with certain jobs, and there's often a 
chance to take leadership positions in these 
jobs. If you're willing to try that, it’s a good 
way to cut your teeth on a project.” He adds 
that those who have taken on and performed 
effectively in such positions can demonstrate 
to their institute or university that they are 
team players who could, for example, make 
contributions to administrative tasks as ten- 
ured faculty members. 

Along with taking on extra tasks, research- 
ers can increase their profile by visiting and 
working in other labs involved in the collabo- 
ration. This helps them to build contacts and 
disseminate their research widely. “Projects 
that do better have postdocs or graduate stu- 
dents spend two or three months working in 
a lab at another site and then go back to their 
home institution,” says Jonathon Cummings, 
who studies scientific collaboration at Duke 
University’s business school in Durham, 
North Carolina. 

But some researchers caution that gradu- 
ate students and postdocs should be wary 
of becoming too closely associated with a 
single project, however glamorous, in case 
they become pigeonholed by peers and 
potential employers. “I worry that Ill be 
viewed as the ‘person who works in Antarc- 
tica’ and that will shape what I do later on,” 
says Luria. “People are so interested in the 
place and fascinated by what we're doing, so 
it would be easy as a young scientist to have 
this experience become the defining quality 
of my work. I’m loving being in Antarctica 
and being a part of this project, but I’m trying 
hard to make sure it doesn't define me for the 
rest of my career.” 

Getting involved in a high-profile consor- 
tium can indeed be a headache, but it is often 
worth the effort, says Lieb. “People complain 
that these consortia are very clubby and dif- 
ficult to get into,’ he says. “It’s kind of true, 
but there's a reason why it’s true. Once you've 
done it, you're more qualified to do it again. 
If youre able to get in early and demonstrate 
your skill at working on a project of this size, 
youre more likely to get another shot.” m 


Sarah Kellogg is a freelance writer in 
Washington DC. 


6 SEP 


© 2012 Macmillan Publishers Limited. All rights reserved 


EUROPE 


Investment increases 


Research and development (R&D) 
investment by European companies is on 
the rise, according to The 2012 EU Survey 
on RexD Investment Business Trends, a 
European Commission report released 
on 20 August. The survey of 1,000 large 
companies across all sectors predicts an 
average R&D boost of 4% a year until 
2014. Chemical companies project an 
increase of 5.5%, and oil and gas producers 
4.6%. “Employment costs are more than 
half of total R&D costs,’ says Alexander 
Tiibke at the Institute for Prospective 
Technological Studies in Seville, Spain, a 
co-author of the report, “so an important 
share of R&D increases should translate 
into new employment.” But, Tiibke notes, 
any resulting researcher recruitment is 
likely to be in countries with lower labour 
costs, such as India and China. 


EDUCATION 


Teachers lack resources 


Full- and part-time teaching faculty 
members without tenure at US academic 
institutions face challenges that detract 
from their work and negatively affect 
their students, says a report released on 

23 August by the New Faculty Majority 
Foundation in Akron, Ohio. A survey of 
500 contingent faculty members found 
that they often don't know until days 
before a class begins that they are to 

teach it, and that most have no access to 
office or lab space, phones or computers. 
Such practices compromise students’ 
educational experience, the report argues. 
Maria Maisto, executive director of the 
foundation, adds that uncertainty and lack 
of office space also hinder development of 
student-mentor relationships. 


ENTREPRENEURSHIP 
Advice for protégés 


To benefit from mentoring, fledgling 
entrepreneurs should be honest with their 
advisers about business issues such as cash 
flow; seek out mentors with similar values, 
personality or interests; and develop trust 
through frequent meetings, says a study 
based ona survey of almost 400 protégés 
(E. St-Jean Int. J. Training Dev. 16, 200-216; 
2012). Entrepreneurs who achieve good 
relationships with their mentors can build 
management knowledge and skills and 
improve their visions for their companies, 
says author Etienne St-Jean, who studies 
business management at the University of 
Quebec at Trois-Riviéres in Canada. 
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Uta SCIENCE FICTION 


BY TONY BALLANTYNE 


¢ ¢ octor,” said Sacha, “Can you give 

D me your assurance that this injec- 

tion won't harm my children?” 

“Well, there’s always some risk, Ms Mel- 
ham. I do have a leaflet that explains every- 
thing...” 

Sacha placed a finger on the table. 

“I don't need a leaflet, Doctor. I simply 
want your assurance that this injection will 
cause Willow and Gregory no harm...” 

Doctor James Ferriday gazed at the finger. 

“As I said, there is always a small risk, but 
if you look, you will see that this is less than 
the probability of ...” 

Sacha held up her hand. 

“Please, Doctor. Dont try and confuse the 
issue” 

“Tm not trying to confuse the issue, I’m 
simply presenting you with the facts...” 

Sacha rose to her feet. 

“Well, I think I’ve heard enough. Willow, 
Gregory, put your coats back on. Thank you, 
Doctor, we'll be... what’s that?” 

James's screen flashed red and green. 

“Oh dear,’ he said, reading the yellow writ- 
ing scrolling across the monitor. “I think you 
should take a seat.” 

Sacha did so. Her son slipped his hand 
into hers. 

“What's the matter, mummy?” 

“Nothing, dear. Is everything OK, Doctor?” 

“Tm sorry, Ms Melham...” he began, and 
then more kindly. “’m sorry, Sacha, but 
you've crossed the threshold. I’m afraid to 
say, you're not allowed science any more.” 

“Tm what?” 

“You're not allowed science any more,” 
repeated James. 

Sacha’s lips moved as she tried to process 
what he had said. 

“You're saying that you're refusing my 
children treatment?” 

“No,’ said James. “Quite the opposite. You 
and your children will always be entitled to 
the best medical care. It’s just that you, Sacha, 
no longer have a say in it. I shall administer 
the vaccination immediately” 

“What?” Sacha sat up, eyes burning with 
indignation. “How dare you? I, and my hus- 
band, are the only ones who say how my 
family is run” 

“Well, yes,’ said James. “But you no longer 
have a say in things where science is involved. 
You're not allowed science any more.’ 

“I never heard anything so ridiculous! 
Who decided that?” 


IF ONLY ... 


A taste of your own medicine. 


“The Universe.” 

“The Universe? Why should the Universe 
say I’m not allowed science any more?” 

“Because you haven't paid science enough 
attention. You've had the opportunity to read 
the facts and the education to be able to ana- 
lyse them, yet you have consistently chosen 
not to” 

“The education?” exclaimed Sacha. “Hah! 
My science education was terrible. None of 
my teachers could explain anything prop- 
erly” 

“Really?” said James. “That 
would certainly be grounds for 
appeal...” 

He pressed a couple of 
buttons. Tables of figures 
appeared on the screen. 

“No,” he said, shaking his 
head. “Tm sorry... it turns 
out that your teachers were 
all really rather excellent. 
You went to avery good pub- 
lic school, after all. If you look at 
your teachers’ results you will see 
they added significant value to their 
pupils attainment.’ 

Sacha pouted. 

“Well, they didn’t like me” 

“Possibly...” 

He pressed a couple more buttons. 

“What?” said Sacha, hearing his sharp 
intake of breath. 

“Look at this,” said James, scrolling down 
along table. “Times and dates of occasions 
when you've proudly admitted to not being 
good at maths.” 

“What's the matter with that? I’m not” 

“Tt’s not the lack of ability, Sacha, it’s the 
fact that you're proud of it. Youd never be 
proud of being illiterate. Why do you think 
your innumeracy is a cause for celebration?” 

“Because... Well...” 

“That’s why you're not allowed science 
any more.” 

“This is outrageous!” snarled Sacha. “How 
can this happen?” 

“Oh, that’s easy,’ said James. “Magic.” 

“Magic?” said Sacha, her eyes suddenly 
shining. “You mean there’s really such a 
thing?” 

“Of course not. But I can’t explain to you 
howit's really done because youre not allowed 

science any more.” 


> NATURE.COM Sacha fumbled for 
FollowFutureson —_—her handbag. 

Facebook at: “Tm calling the 
go.nature.com/mtoodm © BBC,” she said. “ma 
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producer there, you know. I’ll report you.” 

“Report me to who you like,” said James. 
“The story will never get out. All your cam- 
eras and microphones and things work on 
science.” 

Sacha gazed at him. 

“Who gave you the right to control my 
life?” 

“You've got it the wrong way round. You 
gave the right to control your life away. 
You're the one who chose to ignore the way 

the world works.” 

“Hah!” said Sacha. “The way the 
world works! Bloody scientists. 
You think the world is all 
numbers and machines and 
levers. You don't under- 
stand anything about the 

soul or spirit.” 
“Of course I do,” said 

James. “I’ve been hap- 

pily married for 20 years. I 

have two children that I love. 

I play the piano, I enjoy read- 

ing. It's just that I have additional 
ways of looking at things.” 

Sacha stood up. 

“Willow, Gregory. We're going home,’ she 
glared at James. “That is if I’m still allowed 
to drive? You don’t have something against 
women drivers as well do you, Doctor?” 

“This is nothing to do with you being 
female, Ms Melham,” said James, calmly. 
“This is purely about your attitude to sci- 
ence. Now, before you go, I'll administer the 
injection to the three of you.” 

“You will not! I will not allow it” 

“T told you, you have no choice.” 

“Why? Because I disagree with you?” 

For this first time, James’s anger showed 
itself. 

“No!” he snapped. “You dont get it! You're 
allowed to disagree with me, I want you to 
disagree with me! Id love to engage in rea- 
soned debate with you. But until you take 
the trouble to understand what you're talk- 
ing about, you're not allowed science any 
more. Now, roll up your sleeve.” 

Sacha muttered something under her 
breath. 

“What's in the injection?” said James. 
“You know, you start asking questions like 
that, you might get science back...” = 


Tony Ballantyne’s latest collection of tales 
is Stories of the Northern Road (NewCon 
Press). You can find him at tonyballantyne. 
wordpress.com. 
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MATERIALS SCIENCE 


A hard concept in soft matter 


Hydrogels have many potential applications, but their mechanical strength is low. By simultaneously crosslinking 
two kinds of polymers in different ways, a highly fracture-resistant hydrogel has been made. SEE LETTER P.133 


KENNETH R. SHULL 


he stress required to break a pristine 
| piece of standard window glass is 
much larger than that required to 
break a polymer-based acrylic window, yet 
the acrylic window has a far better chance of 
surviving an impact with an errant baseball. 
In fact, the appropriate measure of fracture 
resistance is not fracture stress, but fracture 
energy — impact-resistant glass is designed 
so that the kinetic energy of a baseball is not 
sufficient to cause catastrophic breakage of 
the window. Although this concept has been 
applied quantitatively to relatively stiff mater- 
ials such as glass, ceramics and metals, our 
mechanistic understanding of the fracture of 
soft, highly extensible materials is much more 
limited. On page 133 of this issue, Sun et al.’ 
not only address this issue, but also report a 
highly extensible material that has remarkable 
mechanical toughness. 

The material described by the authors is a 
hydrogel. Broadly, hydrogels are solutions of a 
polymer in water, in which the polymer mol- 
ecules are crosslinked to one another so that 
the material can support a mechanical load. 
Because hydrogels consist primarily of water, 
the concentration of the load-bearing crosslinks 
is low, and so the mechanical strength of hydro- 
gels is typically also low. Hydrogels are therefore 
commonly used in applications in which they 
are not placed under substantial mechanical 
stress — such as in drug delivery and tissue 
engineering, in which the role of the gel is to 
control the distribution of cells or molecular 
species. Although a solid material is needed for 
these applications, a material strength of about 
1 kilopascal (corresponding to a 10-gram load 
distributed over an area of 1 square centimetre) 
is more than sufficient. 

The development of much tougher hydro- 
gels, however, would enable a host of other 
applications to be considered. For example, if 
hydrogels could be made to withstand physi- 
ologically relevant loads, corresponding to 
about 10° grams over a load-bearing area of a 
few square centimetres (ref. 2), then it would 
be possible to prepare materials that mimic the 
behaviour of cartilage. Such materials would 
require a compressive strength in the range 
of several megapascals, and a corresponding 


Polyacrylamide gel 


Alginate gel 


Hybrid gel 


Figure 1 | Energy dissipation in hydrogels. When a hydrogel with a notch in it is stretched, crack 
propagation depends on the polymer chains in the gel. a, For covalently crosslinked polymers such as 
polyacrylamide (green squares indicate crosslinks), the chains ahead of the notch need to break. b, In 
alginate gels — in which calcium ions (red) crosslink binding sites in different chains — the crosslinks 
ahead of the notch need to break. In both a and b, the area over which energy is dissipated (pink region) 
is small, and so cracks propagate easily. c, Sun et al.' report hydrogels that contain crosslinked mixtures 
of polyacrylamide and alginate (triangles represent crosslinks between different polymer types). They 
propose that, on stretching, many non-covalent alginate crosslinks break in a wide zone around the head 
of the notch. Energy is therefore dissipated across a large area, and so the polyacrylamide chains do not 
break. This makes the gel extremely resistant to crack propagation. 


fracture toughness that is much larger than 
any traditional synthetic hydrogel. For a long 
time, fully synthetic hydrogels with this kind 
of mechanical strength simply did not exist. 
This picture changed significantly in 2003, 
with the introduction of a set of ‘double 
network’ hydrogels’ that have compressive 
strengths as high as 20 megapascals, and cor- 
responding fracture energies’ up to 700 joules 
per square metre — about 100 times larger 
than the fracture toughness of a typical hydro- 
gel. These materials are based on two inter- 
penetrating, crosslinked polymer networks. 
The first network has a fairly high density of 
covalent crosslinks, giving the gels a modulus 
(a measure of the material’s elastic stiffness) in 
the megapascal range. This primary network 
is quite brittle, and it fractures at low applied 
strains (strain is a measure of the extent to 
which an object has been deformed bya stress). 
The second network has a much lower 
density of covalent crosslinks than the primary 
network, and does not contribute substantially 
to the hydrogel’s mechanical properties at 
low applied strains. But at large strains, such 
as those encountered in front of a crack that 
is propagating through the gel, the loosely 
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crosslinked secondary network distributes 
stress across a relatively narrow ‘damage zone. 
Energy is dissipated as the primary network is 
broken into small fragments within this zone, 
so that the fracture energy of the double net- 
work is enormously larger than that of either of 
the corresponding single networks. 

In the previously reported double-network 
gels”’, the fracture energy is dissipated irre- 
versibly as covalent bonds are broken within 
the damage zone. The materials therefore 
have excellent fracture toughness, but very 
poor fatigue life — they behave extremely 
well during initial compression, but after a 
subsequent compression the fracture energy 
is greatly diminished. This limitation is 
now addressed by Sun and colleagues. The 
authors replaced the covalently crosslinked 
primary network with an ‘alginate’ net- 
work that forms non-covalent crosslinks in 
the presence of calcium ions (Fig. 1). These 
calcium-based crosslinks form and break 
reversibly, so that much of the energy that 
is dissipated when the material is deformed 
is recoverable. The resulting materials 
can also be deformed to large strains and yet 
still retain a high fracture energy. Impressively, 


some of the gels can be stretched to up to 
20 times their original length before fracture 
occurs, and have corresponding fracture 
energies of about 9,000 joules per square metre. 

Sun and colleagues’ materials are noteworthy 
for three reasons. First, they represent an 
important extension of the double-network 
concept, and greatly enhance the maximum 
extension, fracture energy and retention 
of material properties of hydrogels during 
multiple loading cycles. Second, the materi- 
als are relatively easy to synthesize compared 
with previously reported tough hydrogels. And 
finally, these systems are excellent models for 
investigating fundamental issues of the fracture 
behaviour of soft, highly deformable materials. 

The Supplementary Information to the 
paper is full of experimental details that are 
relevant to this third point. One of the most 
intriguing results is the implication that, when 
an existing crack first propagates, energy dis- 
sipation is confined to a damage zone that is 
much smaller than the overall sample size. 


COSMOLOGY 


It is also clear, however, that the new materials 
can be deformed in such a way that substan- 
tial energy is dissipated throughout the entire 
material before any crack propagation. Addi- 
tional experiments are needed to sort out the 
details of energy dissipation in these materials, 
and the relationship between energy dissipa- 
tion and crack propagation that forms the core 
of any investigation of material fracture. 

A thorough answer to some of these ques- 
tions will require a better understanding of 
the molecular structure of the materials. Sun 
et al. show that the two polymer networks are 
most probably covalently linked to one another 
(Fig. 1). The synthesis of materials that do not 
contain such inter-network links, or in which 
such links can be introduced at a quantifiable 
level, is an obvious next step to refine molec- 
ular-level models of fracture in soft, highly 
deformable materials. 

Conceptually, the design principles used 
to produce toughened, ‘hard’ materials such 
as window glass are the same as those used 


The lithrum problem 


The theory that predicts how the lightest elements formed after the Big Bang has 
hitherto failed to explain the amount of cosmic lithium. The detection of interstellar 
lithium beyond the Milky Way gives this theory a boost. SEE LETTER P.121 


GARIK ISRAELIAN 


ur knowledge of the abundances of 
() light elements, such as hydrogen, 
helium and lithium, in the early 
Universe has relied on measurements of the 
chemical content of the atmospheres of old 
stars in the Milky Way’s halo. These observa- 
tions have long puzzled astronomers because 
they are in partial disagreement with theo- 
retical predictions, which are based on the Big 
Bang nucleosynthesis theory and on a precise 
determination of the cosmic ratio of baryons 
(particles such as protons and neutrons) to 
photons. The measured ‘primordial’ amounts 
of hydrogen and helium match the predic- 
tions, but that of lithium does not. Elsewhere 
in this issue, Howk and colleagues’ (page 121) 
report a measurement of the abundance of the 
lithium-7 isotope in the interstellar medium 
of the Small Magellanic Cloud, a dwarf galaxy 
neighbouring the Milky Way, that is in accord 
with the Big Bang nucleosynthesis theory. 
The nuclei of hydrogen, helium and lithium 
were created when the Universe was between 
2 and 5 minutes old, after the hot primordial 
plasma had cooled sufficiently for protons and 
neutrons to form’. However, the abundance of 
lithium is billions of times lower than that of 
hydrogen and helium. This is because lithium 


is more prone to being destroyed in stars than 
hydrogen and helium are, and there are not 
many processes by which lithium is produced. 

Astronomers have long thought that the 
primordial abundance of lithium is preserved 
in our Galaxy’s stars that are especially old 
and comparatively cool. Stars have a layered 
structure. Nuclear-fusion reactions take place 
in the stars’ inner (and hotter) regions but 
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to produce tough but ‘soft’ materials such as 
polymer gels. Material-specific details matter, 
however, and different methods are needed 
to understand the toughening mechanisms 
in different material classes. Sun and col- 
leagues’ gels will certainly motivate continued 
research by those interested in the mechanical 
properties of soft materials. The authors have 
provided some valuable answers about the 
properties that materials can possess, while at 
the same time generating a variety of questions 
for soft-matter scientists to ponder. = 
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not in their outermost layers. Therefore, the 
composition of the outermost layers should 
indicate the chemical content of the matter 
from which a star has formed. For very old 
stars, such surface chemical abundances should 
be close to the primordial values. For younger 
stars, which formed from material that con- 
tained the nuclear-fusion products of previous 
generations of stars, the surface abundances 
should be different. 

To test the theoretical predictions of the 
Big Bang nucleosynthesis (BBN) theory, 
we need to identify astronomical objects in 
which the primordial abundance values are 
preserved as much as possible, and we need 
to account for any remaining influences 
of chemical evolution. In the early 1980s, 
astronomers discovered’ that old, dwarf stars 
in our Galaxy — Sun-like stars that are poor 
in metals (elements other than hydrogen and 


Figure 1 | The Small Magellanic Cloud. Howk et al.' find that the amount of interstellar lithium in the 
Milky Way’s neighbouring Small Magellanic Cloud galaxy is in agreement with the predictions of the Big 
Bang nucleosynthesis theory. The galaxy is seen here in infrared light collected by the Herschel Space 
Observatory and the Spitzer Space Telescope. 
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helium) — share the same lithium abundance 
irrespective of their temperature and metal 
content. This ‘plateau’ was readily inter- 
preted as evidence that the constant lithium 
abundance was primordial. 

Using observations of the cosmic micro- 
wave background’ (relic radiation from the Big 
Bang) obtained by the Wilkinson Microwave 
Anisotropy Probe satellite, researchers have 
been able to make an accurate measurement 
of the cosmic ratio of baryons to photons. 
Combined with this measurement, the BBN 
theory predicts an abundance of the lithium-7 
isotope that is about four times that inferred 
from measurements of old, metal-poor stars 
in the Milky Way’s halo’. This mismatch con- 
stitutes the ‘lithium problem. The solution 
to this problem can be sought either by con- 
sidering modifications to the BBN theory, or 
by identifying processes by which lithium is 
destroyed in old, metal-poor halo stars so as 
to cause the primordial lithium abundance to 
have evolved, over the stars’ lifetimes, to the 
observed plateau. 

An alternative route for tackling the lithium 
problem — and the one adopted by Howk and 
colleagues in their study — is to determine the 
lithium abundance of metal-poor interstellar 
gas. This approach is unaffected by those pro- 
cesses that can alter the chemical content of 
stellar atmospheres over time. Howk et al. 
obtained high-quality spectroscopic obser- 
vations of the lithium spectral line in the 
metal-poor gas of the Small Magellanic Cloud 
(Fig. 1). They then derived the total lithium 
abundance in the galaxy’s interstellar medium. 
This derivation is a difficult task. It requires 
knowledge of the ionization fraction of 
lithium and an accurate determination of the 
amount of lithium locked in interstellar dust 
grains. The authors used several approaches 
to account for and measure these quantities. 
They found that the present-day abundance 
of interstellar lithium in the Small Magellanic 
Cloud is almost equal to the BBN predictions. 

Howk and colleagues’ results are therefore 
good news for BBN theory. But how can the 
stellar observations be brought into agree- 
ment with the theory? There are many mecha- 
nisms that can destroy lithium in stars, all of 
which imply that the material is processed at 
temperatures exceeding 2.5 million kelvin. It is, 
however, difficult to argue that the same mech- 
anism can account for all of the stars that are 
depleted of lithium. The latest observations” 
of lithium in metal-poor stars in the Galactic 
halo show a ‘meltdown of the lithium plateau 
for low metal abundances, such that lithium 
depletion increases with reduced metal abun- 
dance. However, some stars do not follow this 
trend, and remain on the plateau. This implies 
that the physics of lithium depletion in metal- 
poor Sun-like stars is not properly understood. 

Magnetic activity, or the presence of a com- 
panion star or a giant exoplanet®, can modify the 
surface abundance of lithium in Sun-like stars. 


However, it remains to be investigated whether 
these factors can explain the lithium content 
of metal-poor Galactic-halo stars. There are 
several unanswered questions, but Howk 
et al. provide the first convincing evidence 
that the lithium abundance in Galactic 
metal-poor stars is not primordial. m 
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Lessons from 


heartbreak 


Male fruitflies quickly learn that courting already-mated females is useless. It 
turns out that a small subset of neurons in the male brain signals this negative 
experience and controls pheromone sensitivity. SEE LETTER P.145 


AKI EJIMA 


nimals make behavioural decisions on 

the basis of their prediction of the con- 

sequences, and they learn from experi- 
ence so that they are better prepared for future 
events. On page 145 of this issue, Keleman 
etal. describe how female rejection enhances 
the ability of males of the fruitfly Drosophila 
melanogaster to identify a promising mating 
partner later on. 

Courtship by male fruitflies is largely an 
innate process: the decision to court or not to 
court depends on the potential mate's scent. 
A mature virgin female releases aphrodisiac 
pheromones that trigger the male's courtship, 
whereas males produce other pheromones that 
inhibit such behaviour. Moreover, a previously 
mated female carries some male scent from 
her previous mating, in addition to her own 
aphrodisiac pheromones. As a result, naive 
males court mated females with less enthusi- 
asm than they court virgin females. Neverthe- 
less, once a decision to court is made, the male 
performs an elaborate courtship ritual (Fig. 1) 
with no previous instructive experience’. 

There is, however, one aspect of courtship 
behaviour that can be influenced by experi- 
ence. In 1979, Siegel and Hall’ reported that 
exposing a male fruitfly to a mated female 
(training) led to suppression of the male’s 
subsequent courtship towards a virgin female 
(test). Mated females reject courting males 
because of the influence of sex peptide (SP), a 
component of the seminal fluid transferred by 
the previous male during copulation. Rejected 
males associate the unsuccessful courtship 
experience with the female’s aphrodisiac 
pheromones, and thus suppress their response 
to a virgin. This experience-dependent 
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behavioural modification is called courtship 
conditioning and has served as one of the 
major experimental paradigms for the study 
of associative learning in Drosophila. 

It has become evident, however, that court- 
ship conditioning can follow associative or 
non-associative mechanisms depending on the 
nature of the trainer and tester females’. For 
example, it was unclear whether the enhanced 
courtship suppression produced by repeat- 
edly exposing a male to mated females (at 
both training and test) was based on associa- 
tive learning. Because the male is exposed to 
the same stimuli throughout training and test, 
the enhanced courtship suppression could be 
the result of sensory sensitization to negative 
signals from the mated female. 

In fact, a mated female provides two kinds 
of negative signals to courting males: SP- 
induced rejection behaviour together with 
male-derived pheromones such as cis-vaccenyl 
acetate (cVA). To dissociate the effects of these 
two signal types, Keleman and colleagues used 
‘pseudomated’ females, that is, virgin females 
that express the SP-encoding gene and there- 
fore reject courting males, but lack cVA phero- 
mone. The authors also allowed SP-deficient 
males to mate with normal females, which thus 
became ‘pseudovirgins’ — mated females that 
remain receptive (because of the lack of SP) 
but possess cVA. The authors found that using 
pseudomated females for training and pseudo- 
virgins for test resulted in the same levels of 
courtship suppression as using genuine mated 
females for both training and test. This result 
indicates that enhanced courtship suppres- 
sion is a non-associative behavioural modifi- 
cation: a failed copulation attempt (caused by 
the female's rejection behaviour) enhances the 
male’s sensitivity to cVA, which, in turn, leads 
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Figure 1 | Fruitfly courtship. A male fruitfly (lower) uses his wings in a ritual courtship display to a female. 


to exaggerated courtship suppression. 

The authors also demonstrated that artificial 
activation of a specific small subset of dopa- 
minergic neurons — which use dopamine as 
a neurotransmitter molecule — in the male's 
brain mimicked courtship training and modi- 
fied the male's sensitivity to the pheromone. In 
the fruitfly, dopaminergic neurons are known 
to have roles in associative learning of tasks 
linked to odours*®, and in male—male court- 
ship conditioning’ (mature male fruitflies 
court immature males when first exposed to 


CLIMATE CHANGE 


them, but this behaviour decreases over time 
as a result of experience). Keleman and col- 
leagues go one step further by uncovering 
the molecular and cellular mechanisms by 
which information about a failed courtship 
experience is signalled through dopaminer- 
gic neurons in the male's brain, and how this 
information affects the male’s behavioural 
sensitivity to cVA. 

Male fruitflies have a strong instinct to 
ingratiate themselves with a potential mate, but 
the odds are not always on their side. Because 


Brief but warm 
Antarctic summer 


A temperature record derived from measurements of an ice core drilled on 
James Ross Island, Antarctica, prompts a rethink of what has triggered the 
recent warming trends on the Antarctic Peninsula. SEE LETTER P.141 


ERIC J. STEIG 


Ross Island, on the eastern side of the 

Antarctic Peninsula, and named its high, 
glaciated volcanic peak, Mount Haddington’. 
In 2008, a team of scientists led by Robert 
Mulvaney of the British Antarctic Survey suc- 
cessfully drilled an ice core near the summit 
of Mount Haddington, reaching bedrock 
at 364 metres below the surface. Now, on 
page 141 of this issue, Mulvaney and col- 
leagues” describe a detailed analysis* of the 
ice core that allowed them to produce a long 
record of climate change on the Antarctic 
Peninsula, one of the fastest-warming regions 
on Earth. The record stretches back to at least 


1E 1842, James Clark Ross sailed past James 


*This article and the paper under discussion’ were 
published online on 22 August 2012. 


20,000 years BP (0 yr BP means AD 1950), and 
may extend to about 50,000 Bp. 

Much has changed on the Antarctic Penin- 
sula in the 170 years since Ross’s voyage there. 
The most dramatic changes have occurred 
in just the past two decades, during which a 
number of large ice shelves have collapsed, 
altering the geography of the region. Ross’s 
transit along the eastern shore of his name- 
sake island was apparently blocked by ice at the 
southern entrance to Admiralty Sound (Fig. 1). 
Only after 1995, with the collapse of the ice 
shelf in Prince Gustav Channel between James 
Ross Island and the mainland of the Antarctic 
Peninsula, did circumnavigation of the island 
become possible. 

Surface melting during unusually warm 
summers has had a critical role in the recent 
demise of Antarctic Peninsula ice shelves’, 
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a previously fertilized female will not accept a 
second mating for about a week, males need 
to know when to pull back. A male-derived 
pheromone, cVA, helps them to identify such 
unreceptive females — but why is cVA sensitiv- 
ity so low in naive males? The authors’ finding 
that courtship training enhances pheromone 
sensitivity suggests that the male uses his own 
experience as an indicator of future mating 
probability. This could help the male to opti- 
mize his mating strategy in time and space. 
It would be interesting to see whether the 
opposite is also true: does a successful mating 
experience decrease the male’s sensitivity to 
cVA and, therefore, increase the fly’s ‘sexual 
confidence’? m 
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and it is natural to relate these melting events 
to anthropogenic global warming*. However, 
most researchers have been reluctant to make 
this connection, in part because the record 
of temperature measurements on the penin- 
sula — as elsewhere in Antarctica — is rela- 
tively short, and the natural decade-to-decade 
variability is large*, making the significance 
of recent warming trends difficult to assess. 
Also, it is only on the eastern margin of the 
Antarctic Peninsula that the summertime 
temperature trends are large. On the western 
side, the greatest warming in the past 50 years 
has occurred in winter and spring’, as it has in 
continental West Antarctica’. These differing 
seasonal trends suggest different underlying 
mechanisms’, and many studies have attrib- 
uted the summer warming on the eastern 
peninsula to atmospheric-circulation change 
associated with the Antarctic ozone hole in the 
stratosphere® (the atmospheric layer immedi- 
ately above the troposphere, the lowest portion 
of the atmosphere). Thus, it has been thought 
that if human agency has played a part in the 
demise of Antarctic Peninsula ice shelves, it 
has primarily been through our destruction 
of stratospheric ozone rather than through the 
increased radiative forcing from greenhouse 
gases in the troposphere. 

Mulvaney et al. provide a much longer record 
of temperature than is available from direct 
instrumental observations. Using the oxygen 
and hydrogen isotope ratios measured on the 
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50 Years Ago 


‘Antibiotic activity of various types 
of cannabis resin’ — The differences 
in chemical composition shown 

by various types of cannabis resin 
may be explained by the stage of 
development of a phytochemical 
process by which cannabidiolic 

acid is gradually converted to 
cannabidiol, tetrahydrocannabinols 
and finally to cannabinol ... referred 
to as ‘ripening’ of the resin ... 
According to the results obtained, 
antibiotic activity decreases together 
with the progress of phytochemical 
conversion of cannabinols, that 

is, together with the increase of 
hashish activity. Antibacterial 

agent (cannabidiolic acid) is by the 
ripening process obviously converted 
into hashish-active constituents 
(tetrahydrocannabinols). The 
antibiotically active unripe cannabis 
seems to be more common in 
regions having unfavourable climate, 
whereas tropical samples more often 
correspond to the ripe, hashish- 
active drug. 

A, Radosevié, M. Kupini¢é & Lj. Grlié 
From Nature 8 September 1962 


100 Years Ago 


Weare glad to see that progress 

is gradually being made with the 
synchronisation of clocks ... Last year 
a committee of the British Science 
Guild ... recommended that, asa 
beginning, it would probably be well 
to have a few large public clocks in 
London synchronised, and that these 
should be set apart and considered 
as “standard time clocks.’ An electric 
clock which may be used for the 
purpose suggested by the committee 
has just been built by the Silent Electric 
Clock Co. ... We understand that this 
electric clock ... is also to be controlled 
bya master clock directly synchronised 
from Greenwich. The clock thus 
represents an up-to-date form of 
public timekeeper which is likely to 
be extensively adopted in the future. 
From Nature 5 September 1912 


Figure 1 | Voyage of discovery. This image from James Clark Ross's Voyage of Discovery’ shows 
Admiralty Sound blocked by ice in 1842. Cockburn Island is shown on the left, with vessels HMS Erebus 
and HMS Terror in the foreground. The edge of James Ross Island is visible on the right. An ice-core 
temperature record’ from the summit of James Ross Island shows that recent warming in this area has 


been unusually rapid. 


ice core as palaeothermometers, the authors 
show that warming began at James Ross Island 
in the 1920s, well before the advent of chloro- 
fluorocarbon production and the development 
of the stratospheric ozone hole. This timing is 
in good agreement with the only long instru- 
mental temperature record available anywhere 
near the Antarctic Peninsula — on the sub- 
Antarctic island of Orcadas, some 1,000 kilo- 
metres to the northeast”. It is also in agreement 
with instrumental records for the Southern 
Hemisphere as a whole, and with the ice-core 
record from the West Antarctic Ice Sheet"®. 
Although temperatures on the Antarctic 
Peninsula comparable to those of the present 
have certainly occurred in the past, the last 
time that century-average temperatures were as 
warm as those of the twentieth to early twenty- 
first centuries was about 2,000 years ago — 
corresponding with evidence from marine 
sediment cores indicating that this was the 
last time Prince Gustav Channel was open”. 
Thus, the growth and decay of Antarctic 
Peninsula ice shelves have followed temperature 
variations over thousands of years. 
Mulvaney and colleagues’ results provide 
evidence that the modern occurrence of excep- 
tionally warm temperatures on the Antarctic 
Peninsula may not be attributable solely either 
to the decline of stratospheric ozone — the 
warming trend begins too early — or to natural 
decadal climate variability. Indeed, one could 
postulate, as a null hypothesis, that warming 
on the Antarctic Peninsula is independent of 
the global-warming trend of the past century. 
However, the rate of recent warming at James 
Ross Island is highly unusual, falling within the 
uppermost 0.3% of all century-scale tempera- 
ture trends of the past two millennia, which 
would compel us to reject the null hypothesis 
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with confidence. A caveat is that this conclu- 
sion applies only to mean annual temperatures; 
obtaining seasonal information from ice cores 
is difficult. These results cannot, therefore, be 
considered definitive evidence for exceptional 
long-term trends in summer temperature. 

It does not necessarily follow that current 
warming trends and associated ice-shelf losses 
will continue. A pivotal influence on Antarctic 
Peninsula climate, in addition to the effects of 
greenhouse-gas forcing and ozone changes, are 
the atmospheric-circulation anomalies that 
result from climate changes elsewhere, par- 
ticularly in the tropical Pacific”. How such 
anomalies will evolve in the future is highly 
uncertain’’. Nevertheless, the unusual temper- 
ature increase over the past century suggests 
that relatively modest radiative forcing from 
the global increase in greenhouse gases has had 
a significant effect on the Antarctic Peninsula. 
Continued increases in both mean annual and 
summer temperature on the Antarctic Penin- 
sula are a common feature of projections from 
climate models, given continued increases in 
greenhouse gases“. Mulvaney and colleagues’ 
observations make such projections difficult 
to dismiss. m 
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Separation by 
reconfiguration 


Membranes have been made that are hygro-responsive — their wetting 
properties change when immersed in water. This striking property allows the 
membrane to separate emulsions into their oil and water constituents. 


ROBERT W. FIELD 


il and water don’t mix, so the saying 

goes — unless they form an emulsion, 

in which case it is difficult to get them 
apart. Reporting in Nature Communications, 
Tuteja and colleagues’ describe a simple, scal- 
able method of great potential for separating 
such ‘oily water’ mixtures. They have devel- 
oped membranes whose surfaces are extremely 
repellent to oil, but which allow water to per- 
meate freely when oily water is filtered through 
them, so that the retained liquid is principally 
oil. Unlike mechanical systems such as centri- 
fuges or settling tanks, which separate oil from 
water only if the oil phase is a distinct disper- 
sion of droplets, the authors’ ‘smart’ membranes 
separate emulsions highly efficiently. Such 
hygro-responsive membranes could be devel- 
oped to clean up oil-contaminated sea water. 

Tuteja and colleagues previously reported” 
superoleophobic surfaces — ones that resist 
wetting by liquids that have extremely low 
surface tension, such as oils and alcohols. The 
key to making them was the recognition that 
the surfaces’ texture is crucial for superoleo- 
phobicity. In particular, re-entrant surface cur- 
vature (surfaces that have concave topographic 
features) is required’. So, by making surfaces 
that have an appropriate chemical composi- 
tion, roughened texture and re-entrant surface 
curvature, the authors prepared materials that 
were extremely resistant to wetting by several 
liquids. These surfaces can be thought of as 
omniphobic, because they are highly repellent 
to water as well as to oils. 

More recently, Tuteja’s group went further by 
developing oleophobic membranes’ that sepa- 
rate oily water emulsions when an electric field 
is applied across the membrane. This enabled 
‘on-demand separation of millilitres of emul- 
sion, but it is questionable whether the system 


could be used at an industrial scale. Although 
electrically enhanced processes” were an active 
research area in the 1980s and 1990s, commer- 
cial developments have not followed because 
scaling up is a problem. 

The membranes now reported by Tuteja and 
colleagues’ are different. The authors describe 
them as hygro-responsive, a word that derives 
from the Greek hygros, which means wet. This 
description is certainly pertinent, because 
wetting of the membranes by water — along 
with wicking and capillary flow — is vital 
for their separation properties. The authors 
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prepared their membranes by coating either a 
stainless-steel mesh or a polyester fabric with 
ablend of a polymer and an oligomeric mater- 
ial. The resulting non-wetted membranes 
are both superoleophobic and hydrophobic, 
but when they are wetted, molecules at the 
surface of the coating reconfigure in such a way 
as to enable excellent water permeability while 
retaining superoleophobicity. This reconfigu- 
ration could be attained within a few minutes, 
which means that the time taken to ‘activate’ a 
membrane with water will not be a problem in 
industrial applications. 

A similar reconfiguration has been observed 
at the surfaces of other polymer films, such 
as poly(methyl methacrylate), for which 
the relationship between molecular surface 
rearrangement and wettability has been well 
characterized’. It has also been noted’ that 
surfaces that have been chemically modified 
by the attachment of amphiphilic macro- 
molecules (polymers that have both hydro- 
philic and hydrophobic properties) can lead 
to ‘switchable wetting} in which the surface’s 
wetting properties change depending on the 
properties of fluids to which they are exposed. 
By taking these materials through several 
wetting and drying cycles with water, it was 
shown that surface reconfiguration in these 
systems is reversible. 

Two aspects of the hygro-responsive mem- 
branes' are particularly striking. First, the 
water flux through the steel-mesh membrane 
is exceptionally high at around 43,000 litres 
per square metre per hour (more than 10 litres 
per square metre per second). This is more 
than 1,000 times that of a typical industrial 


d Water recycling 


Figure 1 | Flow scheme for separating oil-water emulsions. The scheme depicts how Tuteja and 
colleagues’ hygro-responsive membranes’ might be used in a flow system for separating the constituents 
of oil-water emulsions. a, The oily water is fed into a vessel where it filters through a membrane. 
Essentially pure water passes through. b, The membrane also causes tiny droplets of oil in the emulsion 
to coalesce at its surface. Once large enough, these rise within the oily water. c, A suspension of the large 
droplets in water is passed into a separate chamber, where the droplets float to the surface and form a 
separate layer of oil. d, The underlying water layer, which still contains a little oil, is recycled back into the 


flow of oily water for further processing. 
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ultrafiltration membrane unit. Second, the fact 
that the authors’ technique for making hygro- 
responsive membranes can be applied to tex- 
tiles and other surfaces is exciting, because this 
will enable a range of options to be explored, 
parallelling the wide range of module types in 
the membrane industry. Filtration modules 
based on hygro-responsive membranes, and 
capable of treating many tonnes of oily water 
each day, may well emerge soon. 

The authors describe their separation tech- 
nique as a capillary-force-based separation 
method — that is, one that exploits the differ- 
ence in capillary forces acting on the individ- 
ual phases of oily water as it interacts with the 
membrane. This is a fair description of the pro- 
cess. More questionable is their statement’ that 
their process is “solely gravity driven”. Although 
gravity can certainly be used to bring oily water 
emulsions into contact with the membrane, if 
an emulsion was pumped between two hygro- 
responsive membranes, I am confident that 
water would penetrate through both mem- 
branes irrespective of their orientation (and 
therefore of the influence of gravity). A simple 
experiment could be performed to test this. 

Tuteja and colleagues also provide an 
equation for the breakthrough pressure of their 
membranes — the maximum pressure differ- 
ence across the membrane at which the mater- 
ial prevents the permeation of oil. This enables 
pumped systems to be designed that use the 
membranes to separate oil-water emulsions. 
Such systems would be low-pressure systems in 
the eyes of process engineers, and would there- 
fore have low operating costs. A design for one 
possible system is shown in Figure 1. 

The authors separated emulsions of water 
and rapeseed oil as proof of concept of their 
work. In a related study’, others have sepa- 
rated mixtures of water and hexadecane (a 
diesel-like hydrocarbon). However, in the real 
world, filtration processes suffer from fouling 
and biofouling of the membranes. Further 
work using sea water and oil, and a systematic 
study of possible foulants, should therefore be 
undertaken to assess the commercial potential 
of these exciting new membranes. = 
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Outflows from 
the first quasars 


Black holes are best known for pulling matter in. But a distant supermassive 
black hole, observed as it was when the Universe was less than a billion years old, 
has been seen pushing gas out of its host galaxy. 


DANIEL MORTLOCK 


stronomers have long known of 
Ais! galaxies, which contain mostly 

old stars and are largely devoid of inter- 
stellar gas. But the finding’” in 2004 that such 
objects existed about 11 billion years ago, when 
the Universe was only 3 billion years old, was 
surprising — it hadn't generally been thought 
that such galaxies could have formed so early. 
The most popular explanation’ was that these 
ancient ellipticals once hosted the earliest 
quasars (accreting supermassive black holes), 
and that the energy released during this quasar 
phase was sufficient to blow out the galaxy’s 
gas. Maiolino and colleagues* now provide a 
significant boost for the quasar-outflow model 
ina paper published in Monthly Notices of the 


Royal Astronomical Society. The authors made 
the remarkable discovery that one such distant 
quasar, known as SDSS J1148+5251, which 
is seen as it was when the Universe was less 
than 1 billion years old, has just the sort of gas 
outflow required by these models. 

The key to this story is the extreme envi- 
ronment at the centre of a galaxy. Most large 
galaxies, including the Milky Way, harbour 
at their centres black holes that have roughly 
a million times the Sun’s mass, but in some 
cases the central black hole can be more than a 
billion times heavier than the Sun. These black 
holes are believed to have grown by accreting 
surrounding gas, a gradual process in which 
the infalling material is compressed into a disk 
and heated to such high temperatures that it 
comfortably outshines all the stars in the host 


Figure 1 | Ejection of gas from a galaxy hosting a quasar. Maiolino et al.* have found a supermassive 
black hole (black circle) ejecting gas from its host galaxy. The white arrows show the spiral paths of 
material being accreted into the black hole, and the orange wavy lines represent photons emitted during 
this accretion process. Most of the photons escape the galaxy, perhaps to be seen by astronomers, but 
some impinge on clouds of gas (blue) in the galaxy, and this radiation pressure drives the gas out of the 
galaxy. The stars (yellow) and dark matter (grey points) are unaffected by the radiation. 
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galaxy. It is these accreting supermassive black 
holes that are known as quasars. 

Although quasars are generally seen only as 
unresolved points of light, they have distinctive 
spectra characterized by broad ultraviolet and 
optical emission lines, which distinguish them 
from other astronomical sources. These emis- 
sion lines are broadened by the Doppler effect 
that is associated with motion in the environ- 
ment close to the quasar, revealing the extreme 
dynamics in the vicinity of the black hole. But 
the lines reveal little about the motion of the 
bulk of the interstellar gas farther out in the 
quasar’s host galaxy. 

The best way around this problem has been 
to try to measure emission lines associated with 
molecules that are not present in the immedi- 
ate surroundings of the black hole. One possi- 
bility is to make observations at submillimetre 
wavelengths, at which there are several 
ionized-carbon emission lines. This method 
has been used to identify outflows from 
relatively nearby quasars (see, for example, 
ref. 5). Maiolino et al. adopted this approach, 
using one of the world’s most sensitive milli- 
metre arrays, the Institut de Radioastronomie 
Millimétrique Plateau de Bure Interfero- 
meter, to measure the shape — and thus the 
velocity profile — of an ionized-carbon emis- 
sion line in the spectrum of SDSS J1148+5251. 
This light was emitted with a wavelength of 
0.158 millimetres but was redshifted by the 
expansion of the Universe so that it reached 
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Earth with a wavelength of 1.17 millimetres. 
The data showed not only a core line with a 
velocity width of a few hundred kilometres 
per second, as expected of material moving in 
alarge galaxy, but also much broader ‘wings’ 
indicative of gas flowing out at speeds of up to 
2,000 kilometres per second. 

By adopting simple models to describe the 
geometry of the outflow (which the obser- 
vations could not reveal), the authors found 
that the host galaxy of SDSS J1148+5251 
was losing 10 solar masses of gas every day. 
Given that the total molecular-gas content 
of the galaxy had previously been estimated 
at 20 billion solar masses’, the galaxy would 
have had all of its gas blown out in about 6 mil- 
lion years — a mere instant in cosmological 
terms. And although the kinetic power of the 
outflow, some 2 x 10** watts, might seem huge, 
it is less than 1% of the total power output of 
the quasar. 

Overall, Maiolino and colleagues’ data and 
interpretation paint a coherent picture of gas 
ejection from quasar host galaxies. However, 
given that quasars are fuelled by infalling 
material, it might seem surprising that they 
can also cause outflows. The explanation is 
that the light emitted by the quasar exerts a 
force (termed radiation pressure) on the sur- 
rounding gas, and in the extreme situation 
around a quasar this is strong enough to drive 
out all of the gas from the galaxy. The stars in 
the galaxy are so much denser than the gas 


engagement RING 


The mechanistic details of the attachment of a small protein, ubiquitin, to 
other proteins are unclear. Crystal structures of the complexes formed by 
the E2-ubiquitin and RING E3 enzymes offer new insights. SEE ARTICLE P.115 


CHRISTOPHER D. LIMA 
& BRENDA A. SCHULMAN 


¢ vit use molecular tags to modulate the 

fates and functions of proteins. One 

such tag is ubiquitin, a small protein 
that regulates nearly every facet of cellular 
function in eukaryotes (organisms such as 
animals, plants and fungi). Tagging a protein 
with ubiquitin requires the sequential action of 
three types of enzyme: El activating enzymes 
attach ubiquitin to a cysteine amino-acid resi- 
due on E2 conjugating enzymes, and E3 ligases 
stimulate ubiquitin transfer from E2-ubiqui- 
tin onto a lysine residue of the substrate pro- 
tein. How E3 enzymes — the most common 
of which belong to the RING family! — carry 


out the final step has been a long-standing 
mystery. Now Plechanovova et al.” (page 115 
of this issue) and Dou et al.* (writing in Nature 
Structural & Molecular Biology) illuminate this 
mechanism at high resolution, by describing 
the structures of RING E3 ligases engaged with 
E2-ubiquitin. Their results suggest a mode of 
action that could apply to other E3 enzymes. 
More than 600 human genes encode RING 
or RING-like E3 ligases, underscoring their 
biological importance’. Canonical RING 
proteins contain a zinc-binding region that 
is rich in cysteine and histidine residues and 
that, on its own, can bind to E2-ubiquitin and 
promote ubiquitin transfer’. Previous crystal 
structures revealed some of the interactions 
between E2 and E3 enzymes, but none of them 
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that they are not noticeably affected, and the 
non-interacting dark matter between the stars 
does not experience any radiation pressure 
at all (Fig. 1). 

Maiolino et al. also found some evidence 
that the outflow is visibly extended in their 
images, which would imply that it spans 
much of the galaxy. However, the tentative 
nature of this measurement, and the impli- 
cation that this would be the largest such 
outflow ever measured, make this result 
speculative at best — a point that the authors 
are careful to make themselves. By contrast, 
the main finding that quasar SDSS J1148+5251 
has been captured in the process of remov- 
ing gas from its host galaxy seems quite 
robust, both because of the remarkable data 
and because of the existence of a compelling 
theoretical model. = 
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had captured the elusive association of an 
E2-ubiquitin intermediate and a RING E3 
ligase. This was largely because the link between 
E2 and ubiquitin is a labile thioester bond. 

Plechanovova et al. and Dou et al. cleverly 
overcame this challenge by using engineered 
E2 proteins that were linked to ubiquitin 
through more-stable bond types (peptide 
and oxyester bonds, respectively). Both 
groups of researchers mixed their engineered 
E2-ubiquitin with an E3 RING ligase, and 
determined the crystal structures of the result- 
ing RING-E2-ubiquitin protein complexes. 
For the E3 ligase, Dou et al. used a dimeric 
BIRC7, whereas Plechanovova et al. used a 
tandem protein fusion (RNF4-RNF4) to 
mimic the RNF4 dimer. 

Earlier studies showed that, in the absence 
of an E3 partner, E2-ubiquitin can adopt many 
inactive (‘oper’) configurations’, which pre- 
sumably prevent the transfer of the molecular 
ubiquitin tag to a substrate protein (Fig. 1a). 
The structures determined by Plechanovova 
et al. and Douet al. reveal that RING E3 ligases 
lock E2-ubiquitin into an activated, closed 
conformation that is poised for ubiquitin 
transfer; such a form has also been described 
in concurrent studies of similar proteins using 
nuclear magnetic resonance”. 

In the RING-E2-ubiquitin crystal struc- 
tures, certain amino-acid residues of one of 
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Figure 1 | A unified model for ubiquitin transfer. The small protein 
ubiquitin is attached to a cysteine residue on E2 conjugating enzymes as an 
intermediate step before being transferred to other proteins through a process 
that is stimulated by E3 ligase enzymes. a, The transfer reaction is presumably 
hindered by a ‘wobbling’ movement of ubiquitin when attached to an isolated 
E2 protein. Plechanovova et al.” and Dou et al.’ report crystal structures of 
E2-ubiquitin bound to dimeric E3 ligases of the RING family. They show 
that RING E3 ligases guide E2-ubiquitin into an active conformation by 


the two RING monomers interact with both 
ubiquitin and the E2 protein. Of note, an argi- 
nine side chain of one RING monomer bridges 
the E2 protein and the carboxy-terminal tail 
of ubiquitin. The opposite RING subunit 
also contacts ubiquitin through, for example, 
a highly evolutionarily conserved tyrosine 
or phenylalanine residue. Moreover, a zinc- 
bound histidine (which is characteristically 
found in canonical RING proteins) interacts 
with ubiquitin through a hydrogen bond. 

The crystal structures also show an exten- 
sive network of interactions between the E2 
protein and its linked ubiquitin. In particular, 
Plechanovova et al. describe a hydrogen bond 
between a carbonyl oxygen of ubiquitin’s 
C-terminal tail and a highly conserved aspara- 
gine side chain of the E2 protein; this aspara- 
gine is known’ to be required for efficient 
ubiquitin transfer. In addition, an aspartate 
residue of the E2 protein, which had previously 
been shown to have a role in activating the sub- 
strate protein's lysine’, is reconfigured in the 
RING-E2-ubiquitin complexes. 

The findings support a model by which 
RING binding reduces the conformational 
heterogeneity of E2-ubiquitin and constrains 
ubiquitin’s C-terminal tail in a shallow cleft 
within the E2 protein (Fig. 1a). As a result, the 
thioester bond becomes suitably positioned for 
attack by the substrate protein’s lysine, and sev- 
eral residues of the E2 protein are rearranged 
to promote the transfer reaction. Both groups 
of authors validated the model through care- 
ful biochemical studies. For example, ubiqui- 
tin transfer was diminished when the authors 
made amino-acid changes in the E3 ligase that 
were predicted to impair its interactions with 
ubiquitin or with the E2 protein®. Moreover, 
Plechanovova and colleagues describe that 
their E2-ubiquitin is a competitive inhibi- 
tor of E3-mediated ubiquitin transfer to 
substrate proteins. This result confirms that 
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E2-ubiquitin (in which the two proteins are 
linked through a peptide bond instead of a 
thioester) is structurally similar to natural 
E2-ubiquitin. 

Does the model hold for other E2 and E3 
proteins? An earlier study’ showed that a non- 
RING E3 ligase (RanBP2) interacts with E2- 
SUMO in such a way that both E2 and SUMO 
(a ubiquitin-like protein) are optimally posi- 
tioned for the transfer reaction to take place. 
And the RING-E2-ubiquitin structures show 
striking similarities to that of the protein com- 
plex formed by RanBP2, an E2 protein and a 
SUMO-tagged protein substrate” (Fig. 1b). 
Furthermore, Plechanovova et al. show that 
CHIP, an E3 ligase belonging to the RING-like 
U-box family, also stimulates ubiquitin trans- 
fer by rearranging E2-ubiquitin into a closed 
configuration. Moreover, computer model- 
ling'*" and nuclear magnetic resonance data” 
have indicated that some monomeric RING, or 
RING-related (SP-RING), E3 ligases contain 
elements that could lock E2-ubiquitin or E2- 
SUMO into a closed conformation. 

However, there is evidence that, for some E2 
proteins, E2-ubiquitin can adopt a closed con- 
figuration in the absence of E3 ligases’*“*. And 
it is unclear whether some other types of E3 
ligase, which transfer ubiquitin through mech- 
anisms different from those used by RING 
proteins, will follow the model described by 
the authors. For example, for E3 ligases of the 
HECT and RBR families, ubiquitin is trans- 
ferred from an E2 protein onto a cysteine in the 
E3 enzyme, before being attached to the pro- 
tein substrate. Although details of the second 
step await elucidation, it has been reported’ 
that HECT binding to E2-ubiquitin promotes 
tag transfer without stimulating E2-ubiquitin 
thioester reactivity, in contrast to RING, SP- 
RING and some other E3 ligases. 

In summary, a unified model emerges for 
those E3 ligases that activate the reactivity 
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establishing specific interactions with both ubiquitin and the E2 protein. In 
particular, an arginine and a tyrosine (or phenylalanine) of the E3 enzyme are 
crucial for securing ubiquitin into a position that activates transfer. As a result 
of these interactions, several residues in the E2 protein (such as an asparagine 
and an aspartate) are reorganized to facilitate the transfer reaction. b, Variations 
on this mechanism are used by monomeric RING, RING-like and some non- 
RING E3 ligases”"’ to activate transfer of ubiquitin (or ubiquitin-like proteins 
such as SUMO) from E2 proteins to a lysine residue on protein substrates. 


of the thioester bond. The binding of an E3 
enzyme restricts the conformations avail- 
able for E2-ubiquitin, which is then forced to 
adopt a configuration that optimally aligns the 
thioester for attack by the substrate’s lysine. 
Future studies are required, however, to 
address how E3-E2-ubiquitin complexes 
interact with their protein substrates. = 


Christopher D. Lima is in the Structural 
Biology Program, Sloan-Kettering Institute, 
New York, New York 10065, USA. 

Brenda A. Schulman is at the Howard 
Hughes Medical Institute, Department of 
Structural Biology, St. Jude Children’s Research 
Hospital, Memphis, Tennessee 38105, USA. 
e-mails: limac@mskcc.org; 
brenda.schulman@stjude.org 


1. Deshaies, R. J. & Joazeiro, C. A. Annu. Rev. Biochem. 
78, 399-434 (2009). 
2. Plechanovova, A., Jaffray, E. G., Tatham, M. H., 
Naismith, J. H. & Hay, R. T. Nature 489, 115-120 
(2012). 
3. Dou, H., Buetow, L., Sibbet, G. J., Cameron, K. & 
Huang, D. T. Nature Struct. Mol. Biol. http://dx.doi. 
org/10.1038/nsmb.2379 (2012). 
4. Pruneda, J. N., Stoll, K. E., Bolton, L. J., Brzovic, P. S. 
& Klevit, R. E. Biochemistry 50, 1624-1633 (2011). 
5. Pruneda, J. N. et a/. Mol. Cell http://dx.doi. 
org/10.1016/j.molcel.2012.07.001 (2012). 
6. Wu, P. Y. et al. EMBO J. 22, 5241-5250 (2003). 
7. Yunus, A. A. & Lima, C. D. Nature Struct. Mol. Biol. 
13, 491-499 (2006). 
8. Plechanovova, A. et al. Nature Struct. Mol. Biol. 18, 
1052-1059 (2011). 
9. Reverter, D. & Lima, C. D. Nature 435, 687-692 
(2005). 

0.Yunus, A. A. & Lima, C. D. Mol. Cell 35, 669-682 
(2009). 

1.Dou, H. et a/. Nature Struct. Mol. Biol. 19, 184-192 
(2012). 

2.Hamilton, K. S. et al. Structure 9, 897-904 
(2001). 

3.Wickliffe, K. E., Lorenz, S., Wemmer, D. E., Kuriyan, J. 
& Rape, M. Cell 144, 769-781 (2011). 

4.Saha, A., Lewis, S., Kleiger, G., Kuhlman, B. & 
Deshaies, R. J. Mol. Cell 42, 75-83 (2011). 

5.Kamadurai, H. B. et al. Mol. Cell 36, 1095-1102 
(2009). 


COVER ILLUSTRATION (PREVIOUS PAGE): CARL DETORRES 


2001 WILL ALWAYS BE REMEMBERED 
AS THE YEAR OF THE HUMAN GENOME. 51. acai 


of its sequence transformed biology, and the exemplary way in which hundreds 
of researchers came together to form a public consortium paved the way for 
‘big science’ in biology. It was an incredible achievement but it was always 

clear that knowing the ‘code’ was only the beginning. To understand how cells 
interpret the information locked within the genome much more needed to be 
learnt. This became the task of ENCODE, the Encyclopedia Of DNA Elements, 
the aim of which was to describe all functional elements encoded in the human 
genome. Nine years after launch, its main efforts culminate in the publication of 
30 coordinated papers, 6 of which are in this issue of Nature. 

Collectively, the papers describe 1,640 data sets generated across 147 
different cell types. Among the many important results there is one that stands 
out above them all: more than 80% of the human genome’s components have 
now been assigned at least one biochemical function. 

The implications of the ENCODE findings extend to many fields in biology. In 
a News & Views Forum on page 52, scientists representing five different areas of 
research share their views on what the results mean to them and their work. On 
page 49, Ewan Birney, the leader and coordinator of the ENCODE consortium, 
discusses the challenges of doing consortium-driven science; related issues are 
explored in a Careers feature on page 165. 

Dizzying amounts of data have been produced by the ENCODE project and 
are openly accessible; countless more analyses are therefore to be expected, in 
addition to the multitude now being published. Finding a balance between data 
collection and analysis is the topic of a News Feature on page 46. 

The papers, which are freely available to all, and the articles in this issue are 
complemented by an extensive range of online features (nature.com/encode). 
Among them are interactive figures in the overview ENCODE paper, which also 
features a virtual machine to allow you to interact more closely with the data 
and their analyses. In line with the community spirit with which the work was 
undertaken, we also present online the related papers published in Genome 
Research and Genome Biology. To help you navigate through the data we have 
created the Nature ENCODE Explorer and we introduce ‘threads’, which allow you 
to explore biological themes between the papers. We hope you enjoy the package. 
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Philip Campbell Editor-in-Chief 
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NEWS & VIEWS 


FORUM: Genomics 


ENCODE explained 


The Encyclopedia of DNA Elements (ENCODE) project dishes up a hearty banquet of data that illuminate the roles of the 
functional elements of the human genome. Here, five scientists describe the project and discuss how the data are influencing 
research directions across many fields. SEE ARTICLES P.57, P75, P.83, P.91, P.101 & LETTER P.109 


Serving up a 
genome feast 
JOSEPH R. ECKER 


a arre with a list of simple ingredients 
and blending them in the precise amounts 
needed to prepare a gourmet meal is a chal- 
lenging task. In many respects, this task is 
analogous to the goal of the ENCODE project’, 
the recent progress of which is described in 
this issue” ’. The project aims to fully describe 
the list of common ingredients (functional 
elements) that make up the human genome 
(Fig. 1). When mixed in the right proportions, 
these ingredients constitute the information 
needed to build all the types of cells, body 
organs and, ultimately, an entire person from 
a single genome. 

The ENCODE pilot project® focused on 
just 1% of the genome — a mere appetizer — 
and its results hinted that the list of human 
genes was incomplete. Although there was 
scepticism about the feasibility of scaling up 
the project to the entire genome and to many 
hundreds of cell types, recent advances in low- 
cost, rapid DNA-sequencing technology radi- 
cally changed that view’. Now the ENCODE 
consortium presents a menu of 1,640 genome- 
wide data sets prepared from 147 cell types, 
providing a six-course serving of papers in 
Nature, along with many companion publica- 
tions in other journals. 

One of the more remarkable findings 
described in the consortium’s ‘entrée’ paper 
(page 57)* is that 80% of the genome con- 
tains elements linked to biochemical func- 
tions, dispatching the widely held view that 
the human genome is mostly ‘junk DNA. The 
authors report that the space between genes 
is filled with enhancers (regulatory DNA ele- 
ments), promoters (the sites at which DNA’s 
transcription into RNA is initiated) and 
numerous previously overlooked regions that 
encode RNA transcripts that are not trans- 
lated into proteins but might have regula- 
tory roles. Of note, these results show that 
many DNA variants previously correlated 


with certain diseases lie within or very near 
non-coding functional DNA elements, pro- 
viding new leads for linking genetic variation 
and disease. 

The five companion articles*’ dish up 
diverse sets of genome-wide data regarding the 
mapping of transcribed regions, DNA binding 
of regulatory proteins (transcription factors) 
and the structure and modifications of chro- 
matin (the association of DNA and proteins 
that makes up chromosomes), among other 
delicacies. 

Djebali and colleagues’ (page 101) describe 
ultra-deep sequencing of RNAs prepared from 
many different cell lines and from specific 
compartments within the cells. They conclude 
that about 75% of the genome is transcribed 
at some point in some cells, and that genes 
are highly interlaced with overlapping tran- 
scripts that are synthesized from both DNA 
strands. These findings force a rethink of the 
definition of a gene and of the minimum unit 
of heredity. 

Moving on to the second and third 
courses, Thurman et al.’ and Neph et al.° 
(pages 75 and 83) have prepared two tasty 
chromatin-related treats. Both studies 
are based on the DNase I hypersensitivity 
assay, which detects genomic regions at 
which enzyme access to, and subsequent 
cleavage of, DNA is unobstructed by chro- 
matin proteins. The authors identified cell- 
specific patterns of DNase I hypersensitive 
sites that show remarkable concordance 
with experimentally determined and com- 
putationally predicted binding sites of 
transcription factors. Moreover, they have 
doubled the number of known recognition 
sequences for DNA-binding proteins in the 
human genome, and have revealed a 50-base- 
pair ‘footprint that is present in thousands of 
promoters’. 

The next course, provided by Gerstein and 
colleagues’ (page 91) examines the principles 
behind the wiring of transcription-factor 


ENCODE 


Encyclopedia of DNA Elements 
nature.com/encode 
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networks. In addition to assigning relatively 
simple functions to genome elements (such 
as ‘protein X binds to DNA element Y’), this 
study attempts to clarify the hierarchies of 
transcription factors and how the intertwined 
networks arise. 

Beyond the linear organization of genes and 
transcripts on chromosomes lies a more com- 
plex (and still poorly understood) network of 
chromosome loops and twists through which 

promoters and more 


“These findin gs distal elements, such 
force arethink of as enhancers, can 
the definition communicate their 
of a gene and of regulatory informa- 
g ee tion to each other. In 

the minimum 


the final course of the 
ENCODE genome 
feast, Sanyal and 
colleagues’ (page 109) map more than 1,000 
of these long-range signals in each cell type. 
Their findings begin to overturn the long-held 
(and probably oversimplified) prediction that 
the regulation of a gene is dominated by its 
proximity to the closest regulatory elements. 

One of the major future challenges for 
ENCODE (and similarly ambitious pro- 
jects) will be to capture the dynamic aspects 
of gene regulation. Most assays provide a 
single snapshot of cellular regulatory events, 
whereas a time series capturing how such 
processes change is preferable. Additionally, 
the examination of large batches of cells — as 
required for the current assays — may pre- 
sent too simplified a view of the underlying 
regulatory complexity, because individual 
cells in a batch (despite being genetically 
identical) can sometimes behave in different 
ways. The development of new technologies 
aimed at the simultaneous capture of mul- 
tiple data types, along with their regulatory 
dynamics in single cells, would help to tackle 
these issues. 

A further challenge is identifying how the 
genomic ingredients are combined to assemble 
the gene networks and biochemical pathways 
that carry out complex functions, such as cell- 
to-cell communication, which enable organs 
and tissues to develop. An even greater chal- 
lenge will be to use the rapidly growing body 


unit of heredity.” 
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Figure 1 | Beyond the sequence. The ENCODE project” ” provides 
information on the human genome far beyond that contained within the DNA 
sequence — it describes the functional genomic elements that orchestrate the 
development and function of a human. The project contains data about the 
degree of DNA methylation and chemical modifications to histones that can 
influence the rate of transcription of DNA into RNA molecules (histones are 
the proteins around which DNA is wound to form chromatin). ENCODE also 
examines long-range chromatin interactions, such as looping, that alter the 
relative proximities of different chromosomal regions in three dimensions and 
also affect transcription. Furthermore, the project describes the binding activity 


of data from genome-sequencing projects to 
understand the range of human phenotypes 
(traits), from normal developmental processes, 
such as ageing, to disorders such as Alzhei- 
mer’s disease"”. 

Achieving these ambitious goals may 
require a parallel investment of functional 
studies using simpler organisms — for exam- 
ple, of the type that might be found scamp- 
ering around the floor, snatching up crumbs 
in the chefs’ kitchen. All in all, however, the 
ENCODE project has served up an all-you- 
can-eat feast of genomic data that we will be 
digesting for some time. Bon appétit! 


Joseph R. Ecker is at the Howard Hughes 
Medical Institute and the Salk Institute for 
Biological Studies, La Jolla, California 92037, 
USA. 

e-mail: ecker@salk.edu 


Expression 
control 


WENDY A. BICKMORE 


Oe the human genome had been 
sequenced, it became apparent that 
an encyclopaedic knowledge of chromatin 
organization would be needed if we were to 
understand how gene expression is regulated. 
The ENCODE project goes a long way to 
achieving this goal and highlights the pivotal 
role of transcription factors in sculpting the 
chromatin landscape. 

Although some of the analyses largely con- 
firm conclusions from previous smaller-scale 
studies, this treasure trove of genome-wide 
data provides fresh insight into regulatory 
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genomic 
elements 


Long-range 
chromatin 
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Chromosome 


of transcription-factor proteins and the architecture (location and sequence) of 
gene-regulatory DNA elements, which include the promoter region upstream of 
the point at which transcription of an RNA molecule begins, and more distant 
(long-range) regulatory elements. Another section of the project was devoted 

to testing the accessibility of the genome to the DNA-cleavage protein DNase I. 
These accessible regions, called DNase I hypersensitive sites, are thought to 
indicate specific sequences at which the binding of transcription factors and 
transcription-machinery proteins has caused nucleosome displacement. In 
addition, ENCODE catalogues the sequences and quantities of RNA transcripts, 
from both non-coding and protein-coding regions. 


pathways and identifies prodigious numbers 
of regulatory elements. This is particularly so 
for Thurman and colleagues’ data* regarding 
DNase I hypersensitive sites (DHSs) and for 
Gerstein and colleagues’ results® concerning 
DNA binding of transcription factors. DHSs 
are genomic regions that are accessible to enzy- 
matic cleavage as a result of the displacement 
of nucleosomes (the basic units of chromatin) 
by DNA-binding proteins (Fig. 1). They are the 
hallmark of cell-type-specific enhancers, which 
are often located far away from promoters. 
The ENCODE papers expose the profusion 
of DHSs — more than 200,000 per cell type, far 
outstripping the number of promoters — and 
their variability between cell types. Through 
the simultaneous presence in the same cell 
type of a DHS and a nearby active promoter, 
the researchers paired half a million enhancers 
with their probable target genes. But this leaves 
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11 Years Ago 


The draft 
human genome 


OUR GENOME UNVEILED 

Unless the human genome contains 
a lot of genes that are opaque to our 
computers, it is clear that we do not 
gain our undoubted complexity 
over worms and plants by using 
many more genes. Understanding 
what does give us our complexity — 
our enormous behavioural 
repertoire, ability to produce 
conscious action, remarkable 
physical coordination (shared with 
other vertebrates), precisely tuned 
alterations in response to external 
variations of the environment, 
learning, memory ... need I go 

on? — remains a challenge for the 
future. 

David Baltimore 

From Nature 15 February 2001 


GENOME SPEAK 

With the draft in hand, researchers 
have a new tool for studying the 
regulatory regions and networks 
of genes. Comparisons with other 
genomes should reveal common 
regulatory elements, and the 
environments of genes shared with 
other species may offer insight into 
function and regulation beyond the 
level of individual genes. The draft 
is also a starting point for studies 
of the three-dimensional packing 
of the genome into a cell’s nucleus. 
Such packing is likely to influence 
gene regulation ... The human 
genome lies before us, ready for 
interpretation. 

Peer Bork and Richard Copley 
From Nature 15 February 2001 


more than 2 million putative enhancers with- 
out known targets, revealing the enormous 
expanse of the regulatory genome landscape 
that is yet to be explored. Chromosome-con- 
formation-capture methods that detect long- 
range physical associations between distant 
DNA regions are attempting to bridge this gap. 
Indeed, Sanyal and colleagues’ applied these 
techniques to survey such associations across 
1% of the genome. 

The ENCODE data start to paint a picture 
of the logic and architecture of transcriptional 
networks, in which DNA binding of a few 
high-affinity transcription factors displaces 
nucleosomes and creates a DHS, which in turn 
facilitates the binding of further, lower-affinity 
factors. The results also support the idea that 
transcription-factor binding can block DNA 
methylation (a chemical modification of DNA 
that affects gene expression), rather than the 
other way around — which is highly relevant 
to the interpretation of disease-associated sites 
of altered DNA methylation”. 

The exquisite cell-type specificity of regula- 
tory elements revealed by the ENCODE studies 
emphasizes the importance of having appropri- 
ate biological material on which to test hypothe- 
ses. The researchers have focused their efforts on 
a set of well-established cell lines, with selected 
assays extended to some freshly isolated cells. 
Challenges for the future include following the 
dynamic changes in the regulatory landscape 
during specific developmental pathways, and 
understanding chromatin structure in tissues 
containing heterogeneous cell populations. 


Wendy A. Bickmore is in the Medical 
Research Council Human Genetics Unit, 
MRC Institute of Genetics and Molecular 
Medicine, University of Edinburgh, 
Edinburgh EH4 2XU, UK. 

e-mail: wendy. bickmore@igmm.ed.ac.uk 


Non-coding 
but functional 
INES BARROSO 


he vast majority of the human genome 

does not code for proteins and, until 
now, did not seem to contain defined gene- 
regulatory elements. Why evolution would 
maintain large amounts of ‘useless’ DNA had 
remained a mystery, and seemed wasteful. It 
turns out, however, that there are good reasons 
to keep this DNA. Results from the ENCODE 
project” * show that most of these stretches of 
DNA harbour regions that bind proteins and 
RNA molecules, bringing these into positions 
from which they cooperate with each other to 
regulate the function and level of expression of 
protein-coding genes. In addition, it seems that 
widespread transcription from non-coding 
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DNA potentially acts as a reservoir for the 
creation of new functional molecules, such as 
regulatory RNAs. 

What are the implications of these results 
for genetic studies of complex human traits 
and disease? Genome-wide association stud- 
ies (GWAS), which link variations in DNA 
sequence with specific traits and diseases, have 
in recent years become the workhorse of the 
field, and have identified thousands of DNA 
variants associated with hundreds of complex 
traits (such as height) 


“The results and diseases (such as 
imply that diabetes). But associ- 
sequencing ation is not causality, 
studies and identifying those 
focusing on variants that are 
protein-coding causally linked toa 
sequences risk given disease or trait, 
missing crucial and understanding 
parts of the how they exert such 
genome.” influence, has been 


difficult. Further- 
more, most of these associated variants lie in 
non-coding regions, so their functional effects 
have remained undefined. 

The ENCODE project provides a detailed 
map of additional functional non-coding 
units in the human genome, including some 
that have cell-type-specific activity. In fact, 
the catalogue contains many more func- 
tional non-coding regions than genes. These 
data show that results of GWAS are typically 
enriched for variants that lie within such 
non-coding functional units, sometimes in 
a cell-type-specific manner that is consist- 
ent with certain traits, suggesting that many 
of these regions could be causally linked to 
disease. Thus, the project demonstrates that 
non-coding regions must be considered when 
interpreting GWAS results, and it provides a 
strong motivation for reinterpreting previous 
GWAS findings. Furthermore, these results 
imply that sequencing studies focusing on 
protein-coding sequences (the ‘exome’) risk 
missing crucial parts of the genome and the 
ability to identify true causal variants. 

However, although the ENCODE cata- 
logues represent a remarkable tour de force, 
they contain only an initial exploration of the 
depths of our genome, because many more cell 
types must yet be investigated. Some of the 
remaining challenges for scientists searching 
for causal disease variants lie in: accessing data 
derived from cell types and tissues relevant to 
the disease under study; understanding how 
these functional units affect genes that may be 
distantly located’; and the ability to generalize 
such results to the entire organism. 


Inés Barroso is at the Wellcome Trust Sanger 
Institute, Hinxton CB10 1SA, UK, and at 

the University of Cambridge Metabolic 
Research Laboratories and NIHR Cambridge 
Biomedical Research Centre, Cambridge, UK. 
e-mail: ib 1 @sanger.ac.uk 
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An integrated encyclopedia of DNA 
elements in the human genome 


The ENCODE Project Consortium* 


The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is 
unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, 
transcription factor association, chromatin structure and histone modification. These data enabled us to assign 
biochemical functions for 80° of the genome, in particular outside of the well-studied protein-coding regions. Many 
discovered candidate regulatory elements are physically associated with one another and with expressed genes, 
providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical 
correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. 
Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an 
expansive resource of functional annotations for biomedical research. 


The human genome sequence provides the 
underlying code for human biology. Despite 
intensive study, especially in identifying 
protein-coding genes, our understanding of the 


ENCODE 


Encyclopedia of DNA Elements 
nature.com/encode 


95% of the genome lies within 8 kilobases (kb) 
of a DNA-protein interaction (as assayed by 
bound ChIP-seq motifs or DNase I footprints), 
and 99% is within 1.7 kb of at least one of the 


genome is far from complete, particularly with 
regard to non-coding RNAs, alternatively spliced transcripts and reg- 
ulatory sequences. Systematic analyses of transcripts and regulatory 
information are essential for the identification of genes and regulatory 
regions, and are an important resource for the study of human biology 
and disease. Such analyses can also provide comprehensive views of the 
organization and variability of genes and regulatory information across 
cellular contexts, species and individuals. 

The Encyclopedia of DNA Elements (ENCODE) project aims to 
delineate all functional elements encoded in the human genome”. 
Operationally, we define a functional element as a discrete genome 
segment that encodes a defined product (for example, protein or 
non-coding RNA) or displays a reproducible biochemical signature 
(for example, protein binding, or a specific chromatin structure). 
Comparative genomic studies suggest that 3-8% of bases are under 
purifying (negative) selection** and therefore may be functional, 
although other analyses have suggested much higher estimates’. 
In a pilot phase covering 1% of the genome, the ENCODE project 
annotated 60% of mammalian evolutionarily constrained bases, but 
also identified many additional putative functional elements without 
evidence of constraint’. The advent of more powerful DNA sequencing 
technologies now enables whole-genome and more precise analyses 
with a broad repertoire of functional assays. 

Here we describe the production and initial analysis of 1,640 data 
sets designed to annotate functional elements in the entire human 
genome. We integrate results from diverse experiments within cell types, 
related experiments involving 147 different cell types, and all ENCODE 
data with other resources, such as candidate regions from genome-wide 
association studies (GWAS) and evolutionarily constrained regions. 
Together, these efforts reveal important features about the organization 
and function of the human genome, summarized below. 

e The vast majority (80.4%) of the human genome participates in at 
least one biochemical RNA- and/or chromatin-associated event in at 
least one cell type. Much of the genome lies close to a regulatory event: 


biochemical events measured by ENCODE. 

e Primate-specific elements as well as elements without detectable 
mammalian constraint show, in aggregate, evidence of negative selec- 
tion; thus, some of them are expected to be functional. 

e Classifying the genome into seven chromatin states indicates an initial 
set of 399,124 regions with enhancer-like features and 70,292 regions 
with promoter-like features, as well as hundreds of thousands of qui- 
escent regions. High-resolution analyses further subdivide the genome 
into thousands of narrow states with distinct functional properties. 

e It is possible to correlate quantitatively RNA sequence production 
and processing with both chromatin marks and transcription factor 
binding at promoters, indicating that promoter functionality can 
explain most of the variation in RNA expression. 

e Many non-coding variants in individual genome sequences lie in 
ENCODE-annotated functional regions; this number is at least as 
large as those that lie in protein-coding genes. 

e Single nucleotide polymorphisms (SNPs) associated with disease by 
GWAS are enriched within non-coding functional elements, with a 
majority residing in or near ENCODE-defined regions that are out- 
side of protein-coding genes. In many cases, the disease phenotypes 
can be associated with a specific cell type or transcription factor. 


ENCODE data production and initial analyses 

Since 2007, ENCODE has developed methods and performed a large 
number of sequence-based studies to map functional elements across 
the human genome’. The elements mapped (and approaches used) 
include RNA transcribed regions (RNA-seq, CAGE, RNA-PET and 
manual annotation), protein-coding regions (mass spectrometry), 
transcription-factor-binding sites (ChIP-seq and DNase-seq), 
chromatin structure (DNase-seq, FAIRE-seq, histone ChIP-seq and 
MNase-seq), and DNA methylation sites (RRBS assay) (Box 1 lists 
methods and abbreviations; Supplementary Table 1, section P, details 
production statistics)’. To compare and integrate results across the 
different laboratories, data production efforts focused on two selected 


“Lists of participants and their affiliations appear at the end of the paper. 
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ARTICLE 


BOX | 
ENCODE abbreviations 


RNA-seq. Isolation of RNA sequences, often with different purification 
techniques to isolate different fractions of RNA followed by high- 
throughput sequencing. 

CAGE. Capture of the methylated cap at the 5’ end of RNA, followed by 
high-throughput sequencing of a small tag adjacent to the 

5’ methylated caps. 5’ methylated caps are formed at the initiation of 
transcription, although other mechanisms also methylate 5’ ends of 
RNA. 
RNA-PET. Simultaneous capture of RNAs with both a 5’ methyl cap 
and a poly(A) tail, which is indicative of a full-length RNA. This is then 
followed by sequencing a short tag from each end by high-throughput 
sequencing. 
ChIP-seq. Chromatin immunoprecipitation followed by sequencing. 
Specific regions of crosslinked chromatin, which is genomic DNA in 
complex with its bound proteins, are selected by using an antibody toa 
specific epitope. The enriched sample is then subjected to high- 
throughput sequencing to determine the regions in the genome most 
often bound by the protein to which the antibody was directed. Most 
often used are antibodies to any chromatin-associated epitope, 
including transcription factors, chromatin binding proteins and 
specific chemical modifications on histone proteins. 

DNase-seq. Adaption of established regulatory sequence assay to 
modern techniques. The DNase | enzyme will preferentially cut live 
chromatin preparations at sites where nearby there are specific (non- 
histone) proteins. The resulting cut points are then sequenced using 
high-throughput sequencing to determine those sites ‘hypersensitive’ 
to DNase |, corresponding to open chromatin. 
FAIRE-seq. Formaldehyde assisted isolation of regulatory elements. 
FAIRE isolates nucleosome-depleted genomic regions by exploiting 
the difference in crosslinking efficiency between nucleosomes (high) 
and sequence-specific regulatory factors (low). FAIRE consists of 
crosslinking, phenol extraction, and sequencing the DNA fragments in 
the aqueous phase. 

RRBS. Reduced representation bisulphite sequencing. Bisulphite 
treatment of DNA sequence converts unmethylated cytosines to 
uracil. To focus the assay and save costs, specific restriction enzymes 
that cutaround CpG dinucleotides can reduce the genome to a portion 
specifically enriched in CpGs. This enriched sample is then sequenced 
to determine the methylation status of individual cytosines 
quantitatively. 

Tier 1. Tier 1 cell types were the highest-priority set and comprised 
three widely studied cell lines: K562 erythroleukaemia cells; 
GM12878, a B-lymphoblastoid cell line that is also part of the 1000 
Genomes project (http://1 000genomes.org)°°; and the H1 embryonic 
stem cell (H1 hESC) line. 

Tier 2. The second-priority set of cell types in the ENCODE project 
which included HeLa-S3 cervical carcinoma cells, HepG2 
hepatoblastoma cells and primary (non-transformed) human 
umbilical vein endothelial cells (HUVECs). 

Tier 3. Any other ENCODE cell types not in tier 1 or tier 2. 


sets of cell lines, designated ‘tier 1’ and ‘tier 2’ (Box 1). To capture a 
broader spectrum of biological diversity, selected assays were also 
executed on a third tier comprising more than 100 cell types including 
primary cells. All data and protocol descriptions are available at 
http://www.encodeproject.org/, and a User’s Guide including details 
of cell-type choice and limitations was published recently’. 


Integration methodology 

For consistency, data were generated and processed using standardized 
guidelines, and for some assays, new quality-control measures were 
designed (see refs 3, 12 and http://encodeproject.org/ENCODE/ 
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dataStandards.html; A. Kundaje, personal communication). Uniform 
data-processing methods were developed for each assay (see 
Supplementary Information; A. Kundaje, personal communication), 
and most assay results can be represented both as signal information 
(a per-base estimate across the genome) and as discrete elements 
(regions computationally identified as enriched for signal). Extensive 
processing pipelines were developed to generate each representation 
(M. M. Hoffman et al., manuscript in preparation and A. Kundaje, 
personal communication). In addition, we developed the irreproducible 
discovery rate (IDR)'’ measure to provide a robust and conservative 
estimate of the threshold where two ranked lists of results from bio- 
logical replicates no longer agree (that is, are irreproducible), and we 
applied this to defining sets of discrete elements. We identified, and 
excluded from most analyses, regions yielding untrustworthy signals 
likely to be artefactual (for example, multicopy regions). Together, these 
regions comprise 0.39% of the genome (see Supplementary 
Information). The poster accompanying this issue represents different 
ENCODE- identified elements and their genome coverage. 


Transcribed and protein-coding regions 

We used manual and automated annotation to produce a compre- 
hensive catalogue of human protein-coding and non-coding RNAs as 
well as pseudogenes, referred to as the GENCODE reference gene 
set'*’> (Supplementary Table 1, section U). This includes 20,687 
protein-coding genes (GENCODE annotation, v7) with, on average, 
6.3 alternatively spliced transcripts (3.9 different protein-coding tran- 
scripts) per locus. In total, GENCODE-annotated exons of protein- 
coding genes cover 2.94% of the genome or 1.22% for protein-coding 
exons. Protein-coding genes span 33.45% from the outermost start to 
stop codons, or 39.54% from promoter to poly(A) site. Analysis of 
mass spectrometry data from K562 and GM12878 cell lines yielded 57 
confidently identified unique peptide sequences in intergenic regions 
relative to GENCODE annotation. Taken together with evidence of 
pervasive genome transcription”’, these data indicate that additional 
protein-coding genes remain to be found. 

In addition, we annotated 8,801 automatically derived small RNAs 
and 9,640 manually curated long non-coding RNA (IncRNA) loci’. 
Comparing IncRNAs to other ENCODE data indicates that IncRNAs 
are generated through a pathway similar to that for protein-coding 
genes’’. The GENCODE project also annotated 11,224 pseudogenes, 
of which 863 were transcribed and associated with active chromatin”’. 


RNA 

We sequenced RNA” from different cell lines and multiple subcellular 
fractions to develop an extensive RNA expression catalogue. Using a 
conservative threshold to identify regions of RNA activity, 62% of 
genomic bases are reproducibly represented in sequenced long (>200 
nucleotides) RNA molecules or GENCODE exons. Of these bases, only 
5.5% are explained by GENCODE exons. Most transcribed bases are 
within or overlapping annotated gene boundaries (thatis, intronic), and 
only 31% of bases in sequenced transcripts were intergenic’®. 

We used CAGE-seq (5' cap-targeted RNA isolation and sequencing) 
to identify 62,403 transcription start sites (TSSs) at high confidence 
(IDR of 0.01) in tier 1 and 2 cell types. Of these, 27,362 (44%) are within 
100 base pairs (bp) of the 5’ end of aGENCODE-annotated transcript 
or previously reported full-length messenger RNA. The remaining 
regions predominantly lie across exons and 3’ untranslated regions 
(UTRs), and some exhibit cell-type-restricted expression; these may 
represent the start sites of novel, cell-type-specific transcripts. 

Finally, we saw a significant proportion of coding and non-coding 
transcripts processed into steady-state stable RNAs shorter than 200 
nucleotides. These precursors include transfer RNA, microRNA, 
small nuclear RNA and small nucleolar RNA (tRNA, miRNA, 
snRNA and snoRNA, respectively) and the 5’ termini of these pro- 
cessed products align with the capped 5’ end tags’®. 
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Table 1| Summary of transcription factor classes analysed in 
ENCOD 


Acronym Description Factors 
analysed 
ChromRem ATP-dependent chromatin complexes 5 
DNARep DNA repair 3 
HISase Histone acetylation, deacetylation or methylation 8 
complexes 
Other Cyclin kinase associated with transcription 1 
Pol2 Pol Il subunit 1 (2 forms) 
Pol3 Pol Ill-associated 6 
TFNS General Pol Il-associated factor, not site-specific 8 
TFSS Pol Il transcription factor with sequence-specific DNA 87 
binding 


Protein bound regions 

To identify regulatory regions directly, we mapped the binding loca- 
tions of 119 different DNA-binding proteins and a number of RNA 
polymerase components in 72 cell types using ChIP-seq (Table 1, 
Supplementary Table 1, section N, and ref. 19); 87 (73%) were 
sequence-specific transcription factors. Overall, 636,336 binding 
regions covering 231 megabases (Mb; 8.1%) of the genome are 
enriched for regions bound by DNA-binding proteins across all cell 
types. We assessed each protein-binding site for enrichment of known 
DNA-binding motifs and the presence of novel motifs. Overall, 86% 
of the DNA segments occupied by sequence-specific transcription 
factors contained a strong DNA-binding motif, and in most (55%) 
cases the known motif was most enriched (P. Kheradpour and 
M. Kellis, manuscript in preparation). 

Protein-binding regions lacking high or moderate affinity cognate 
recognition sites have 21% lower median scores by rank than regions 
with recognition sequences (Wilcoxon rank sum P value <10 1°), 
Eighty-two per cent of the low-signal regions have high-affinity recog- 
nition sequences for other factors. In addition, when ChIP-seq peaks 
are ranked by their concordance with their known recognition 
sequence, the median DNase I accessibility is twofold higher in the 
bottom 20% of peaks than in the upper 80% (genome structure 
correction (GSC). P value <10 '°), consistent with previous 
observations*’**. We speculate that low signal regions are either 
lower-affinity sites’ or indirect transcription-factor target regions 
associated through interactions with other factors (see also refs 25, 26). 

We organized all the information associated with each transcrip- 
tion factor—including the ChIP-seq peaks, discovered motifs and 
associated histone modification patterns—in FactorBook (http://www. 
factorbook.org; ref. 26), a public resource that will be updated as the 
project proceeds. 


DNase I hypersensitive sites and footprints 

Chromatin accessibility characterized by DNase I hypersensitivity is 
the hallmark of regulatory DNA regions*”**. We mapped 2.89 million 
unique, non-overlapping DNase I hypersensitive sites (DHSs) by 
DNase-seq in 125 cell types, the overwhelming majority of which lie 
distal to TSSs”’. We also mapped 4.8 million sites across 25 cell types 
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that displayed reduced nucleosomal crosslinking by FAIRE, many of 
which coincide with DHSs. In addition, we used micrococcal nuclease 
to map nucleosome occupancy in GM12878 and K562 cells*®. 

In tier 1 and tier 2 cell types, we identified a mean of 205,109 DHSs 
per cell type (at false discovery rate (FDR) 1%), encompassing an 
average of 1.0% of the genomic sequence in each cell type, and 3.9% 
in aggregate. On average, 98.5% of the occupancy sites of transcription 
factors mapped by ENCODE ChIP-seq (and, collectively, 94.4% of all 
1.1 million transcription factor ChIP-seq peaks in K562 cells) lie within 
accessible chromatin defined by DNase I hotspots”. However, a 
small number of factors, most prominently heterochromatin-bound 
repressive complexes (for example, the TRIM28-SETDB1-ZNF274 
complex*'** encoded by the TRIM28, SETDB1 and ZNF274 genes), 
seem to occupy a significant fraction of nucleosomal sites. 

Using genomic DNase I footprinting**** on 41 cell types we iden- 
tified 8.4 million distinct DNase I footprints (FDR 1%)”. Our de novo 
motif discovery on DNase I footprints recovered ~90% of known 
transcription factor motifs, together with hundreds of novel evolutio- 
narily conserved motifs, many displaying highly cell-selective occu- 
pancy patterns similar to major developmental and tissue-specific 
regulators. 


Regions of histone modification 

We assayed chromosomal locations for up to 12 histone modifications 
and variants in 46 cell types, including a complete matrix of eight 
modifications across tier 1 and tier 2. Because modification states 
may span multiple nucleosomes, which themselves can vary in position 
across cell populations, we used a continuous signal measure of histone 
modifications in downstream analysis, rather than calling regions 
(M. M. Hoffman et al., manuscript in preparation; see http://code. 
google.com/p/align2rawsignal/). For the strongest, ‘peak-like’ histone 
modifications, we used MACS”** to characterize enriched sites. Table 2 
describes the different histone modifications, their peak characteristics, 
and a summary of their known roles (reviewed in refs 36-39). 

Our data show that global patterns of modification are highly vari- 
able across cell types, in accordance with changes in transcriptional 
activity. Consistent with previous studies***', we find that integration 
of the different histone modification information can be used system- 
atically to assign functional attributes to genomic regions (see below). 


DNA methylation 

Methylation of cytosine, usually at CpG dinucleotides, is involved in 
epigenetic regulation of gene expression. Promoter methylation is 
typically associated with repression, whereas genic methylation cor- 
relates with transcriptional activity”. We used reduced representation 
bisulphite sequencing (RRBS) to profile DNA methylation quantita- 
tively for an average of 1.2 million CpGs in each of 82 cell lines and 
tissues (8.6% of non-repetitive genomic CpGs), including CpGs in 
intergenic regions, proximal promoters and intragenic regions (gene 
bodies)”, although it should be noted that the RRBS method pref- 
erentially targets CpG-rich islands. We found that 96% of CpGs 
exhibited differential methylation in at least one cell type or tissue 


Histone modification Signal Putative functions 
or variant characteristics 
H2A.Z Peak Histone protein variant (H2A.Z) associated with regulatory elements with dynamic chromatin 

H3K4mel Peak/region Mark of regulatory elements associated with enhancers and other distal elements, but also enriched downstream of transcription starts 
H3K4me2 Peak Mark of regulatory elements associated with promoters and enhancers 

H3K4me3 Peak Mark of regulatory elements primarily associated with promoters/transcription starts 

H3K9ac Peak Mark of active regulatory elements with preference for promoters 

H3K9mel Region Preference for the 5’ end of genes 

H3K9me3 Peak/region Repressive mark associated with constitutive heterochromatin and repetitive elements 

H3K27ac Peak Mark of active regulatory elements; may distinguish active enhancers and promoters from their inactive counterparts 
H3K27me3 Region Repressive mark established by polycomb complex activity associated with repressive domains and silent developmental genes 
H3K36me3 Region Elongation mark associated with transcribed portions of genes, with preference for 3’ regions after intron 1 
H3K79me2 Region Transcription-associated mark, with preference for 5’ end of genes 
H4K20mel1 Region Preference for 5’ end of genes 
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assayed (K. Varley et al., personal communication), and levels of 
DNA methylation correlated with chromatin accessibility. The most 
variably methylated CpGs are found more often in gene bodies and 
intergenic regions, rather than in promoters and upstream regulatory 
regions. In addition, we identified an unexpected correspondence 
between unmethylated genic CpG islands and binding by P300, a 
histone acetyltransferase linked to enhancer activity”. 

Because RRBS is a sequence-based assay with single-base resolu- 
tion, we were able to identify CpGs with allele-specific methylation 
consistent with genomic imprinting, and determined that these loci 
exhibit aberrant methylation in cancer cell lines (K. Varley et al., 
personal communication). Furthermore, we detected reproducible 
cytosine methylation outside CpG dinucleotides in adult tissues*, 
providing further support that this non-canonical methylation event 
may have important roles in human biology (K. Varley et al., personal 
communication). 


Chromosome-interacting regions 

Physical interaction between distinct chromosome regions that can be 
separated by hundreds of kilobases is thought to be important in the 
regulation of gene expression**. We used two complementary chro- 
mosome conformation capture (3C)-based technologies to probe 
these long-range physical interactions. 

A 3C-carbon copy (5C) approach*”* provided unbiased detection 
of long-range interactions with TSSs in a targeted 1% of the genome 
(the 44 ENCODE pilot regions) in four cell types (GM12878, K562, 
HeLa-S3 and H1 hESC)”. We discovered hundreds of statistically 
significant long-range interactions in each cell type after accounting 
for chromatin polymer behaviour and experimental variation. Pairs 
of interacting loci showed strong correlation between the gene 
expression level of the TSS and the presence of specific functional 
element classes such as enhancers. The average number of distal ele- 
ments interacting with a TSS was 3.9, and the average number of TSSs 
interacting with a distal element was 2.5, indicating a complex net- 
work of interconnected chromatin. Such interwoven long-range 
architecture was also uncovered genome-wide using chromatin inter- 
action analysis with paired-end tag sequencing (ChIA-PET)” applied 
to identify interactions in chromatin enriched by RNA polymerase II 
(Pol II) ChIP from five cell types*’. In K562 cells, we identified 127,417 
promoter-centred chromatin interactions using ChIA-PET, 98% of 
which were intra-chromosomal. Whereas promoter regions of 2,324 
genes were involved in ‘single-gene’ enhancer—promoter interactions, 
those of 19,813 genes were involved in ‘multi-gene’ interaction com- 
plexes spanning up to several megabases, including promoter- 
promoter and enhancer-promoter interactions”’. 

These analyses portray a complex landscape of long-range gene- 
element connectivity across ranges of hundreds of kilobases to several 
megabases, including interactions among unrelated genes (Supplemen- 
tary Fig. 1, section Y). Furthermore, in the 5C results, 50-60% of long- 
range interactions occurred in only one of the four cell lines, indicative 
of a high degree of tissue specificity for gene-element connectivity”. 


Summary of ENCODE-identified elements 

Accounting for all these elements, a surprisingly large amount of the 
human genome, 80.4%, is covered by at least one ENCODE-identified 
element (detailed in Supplementary Table 1, section Q). The broadest 
element class represents the different RNA types, covering 62% of the 
genome (although the majority is inside of introns or near genes). 
Regions highly enriched for histone modifications form the next 
largest class (56.1%). Excluding RNA elements and broad histone 
elements, 44.2% of the genome is covered. Smaller proportions of 
the genome are occupied by regions of open chromatin (15.2%) or 
sites of transcription factor binding (8.1%), with 19.4% covered by at 
least one DHS or transcription factor ChIP-seq peak across all cell 
lines. Using our most conservative assessment, 8.5% of bases are 
covered by either a transcription-factor-binding-site motif (4.6%) 
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or a DHS footprint (5.7%). This, however, is still about 4.5-fold higher 
than the amount of protein-coding exons, and about twofold higher 
than the estimated amount of pan-mammalian constraint. 

Given that the ENCODE project did not assay all cell types, or all 
transcription factors, and in particular has sampled few specialized or 
developmentally restricted cell lineages, these proportions must be 
underestimates of the total amount of functional bases. However, 
many assays were performed on more than one cell type, allowing 
assessment of the rate of discovery of new elements. For both DHSs 
and CTCF-bound sites, the number of new elements initially increases 
rapidly with a steep gradient for the saturation curve and then slows 
with increasing number of cell types (Supplementary Figs 1 and 2, 
section R). With the current data, at the flattest part of the saturation 
curve each new cell type adds, on average, 9,500 DHS elements (across 
106 cell types) and 500 CT'CF-binding elements (across 49 cell types), 
representing 0.45% of the total element number. We modelled 
saturation for the DHSs and CTCF-binding sites using a Weibull 
distribution (r° > 0.999) and predict saturation at approximately 
4.1 million (standard error (s.e.) = 108,000) and 185,100 (s.e. = 18,020) 
sites, respectively, indicating that we have discovered around half of the 
estimated total DHSs. These estimates represent a lower bound, but 
reinforce the observation that there is more non-coding functional 
DNA than either coding sequence or mammalian evolutionarily con- 
strained bases. 


The impact of selection on functional elements 

From comparative genomic studies, at least 3-8% of bases are under 
purifying (negative) selection*”', indicating that these bases may 
potentially be functional. We previously found that 60% of mammalian 
evolutionarily constrained bases were annotated in the ENCODE pilot 
project, but also observed that many functional elements lacked 
evidence of constraint’, a conclusion substantiated by others’. The 
diversity and genome-wide occurrence of functional elements now 
identified provides an unprecedented opportunity to examine further 
the forces of negative selection on human functional sequences. 

We examined negative selection using two measures that highlight 
different periods of selection in the human genome. The first measure, 
inter-species, pan-mammalian constraint (GERP-based scores; 
24 mammals’), addresses selection during mammalian evolution. 
The second measure is intra-species constraint estimated from the 
numbers of variants discovered in human populations using data from 
the 1000 Genomes project®’, and covers selection over human evolu- 
tion. In Fig. 1, we plot both these measures of constraint for different 
classes of identified functional elements, excluding features overlapping 
exons and promoters that are known to be constrained. Each graph also 
shows genomic background levels and measures of coding-gene con- 
straint for comparison. Because we plot human population diversity on 
an inverted scale, elements that are more constrained by negative selec- 
tion will tend to lie in the upper and right-hand regions of the plot. 

For DNase I elements (Fig. 1b) and bound motifs (Fig. 1c), most 
sets of elements show enrichment in pan-mammalian constraint and 
decreased human population diversity, although for some cell types 
the DNase I sites do not seem overall to be subject to pan-mammalian 
constraint. Bound transcription factor motifs have a natural control 
from the set of transcription factor motifs with equal sequence poten- 
tial for binding but without binding evidence from ChIP-seq experi- 
ments—in all cases, the bound motifs show both more mammalian 
constraint and higher suppression of human diversity. 

Consistent with previous findings, we do not observe genome-wide 
evidence for pan-mammalian selection of novel RNA sequences 
(Fig. 1d). There are also a large number of elements without mammalian 
constraint, between 17% and 90% for transcription-factor-binding 
regions as well as DHSs and FAIRE regions. Previous studies could 
not determine whether these sequences are either biochemically active, 
but with little overall impact on the organism, or under lineage- 
specific selection. By isolating sequences preferentially inserted into 
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Figure 1 | Impact of selection on ENCODE functional elements in 
mammals and human populations. a, Levels of pan-mammalian constraint 
(mean GERP score; 24 mammals*, x axis) compared to diversity, a measure of 
negative selection in the human population (mean expected heterozygosity, 
inverted scale, y axis) for ENCODE data sets. Each point is an average for a 
single data set. The top-right corners have the strongest evolutionary constraint 
and lowest diversity. Coding (C), UTR (U), genomic (G), intergenic (IG) and 
intronic (IN) averages are shown as filled squares. In each case the vertical and 
horizontal cross hairs show representative levels for the neutral expectation for 
mammalian conservation and human population diversity, respectively. The 
spread over all non-exonic ENCODE elements greater than 2.5 kb from TSSs is 
shown. The inner dashed box indicates that parts of the plot have been 
magnified for the surrounding outer panels, although the scales in the outer 
plots provide the exact regions and dimensions magnified. The spread for DHS 
sites (b) and RNA elements (d) is shown in the plots on the left. RNA elements 


the primate lineage, which is only feasible given the genome-wide scale 
of this data, we are able to examine this issue specifically. Most primate- 
specific sequence is due to retrotransposon activity, but an appreciable 
proportion is non-repetitive primate-specific sequence. Of 104,343,413 
primate-specific bases (excluding repetitive elements), 67,769,372 
(65%) are found within ENCODE-identified elements. Examination 
of 227,688 variants segregating in these primate-specific regions 
revealed that all classes of elements (RNA and regulatory) show 
depressed derived allele frequencies, consistent with recent negative 
selection occurring in at least some of these regions (Fig. le). An alterna- 
tive approach examining sequences that are not clearly under pan- 
mammalian constraint showed a similar result (L. Ward and 
M. Kellis, manuscript submitted). This indicates that an appreciable 
proportion of the unconstrained elements are lineage-specific elements 
required for organismal function, consistent with long-standing views 
of recent evolution®, and the remainder are probably ‘neutral’ elements” 
that are not currently under selection but may still affect cellular or 
larger scale phenotypes without an effect on fitness. 


DAF 


are either long novel intronic (dark green) or long intergenic (light green) 
RNAs. The horizontal cross hairs are colour-coded to the relevant data set in 
d. c, Spread of transcription factor motif instances either in regions bound by 
the transcription factor (orange points) or in the corresponding unbound motif 
matches in grey, with bound and unbound points connected with an arrow in 
each case showing that bound sites are generally more constrained and less 
diverse. e, Derived allele frequency spectrum for primate-specific elements, 
with variations outside ENCODE elements in black and variations covered by 
ENCODE elements in red. The increase in low-frequency alleles compared to 
background is indicative of negative selection occurring in the set of variants 
annotated by the ENCODE data. f, Aggregation of mammalian constraint 
scores over the glucocorticoid receptor (GR) transcription factor motif in 
bound sites, showing the expected correlation with the information content of 
bases in the motif. An interactive version of this figure is available in the online 
version of the paper. 


The binding patterns of transcription factors are not uniform, and 
we can correlate both inter- and intra-species measures of negative 
selection with the overall information content of motif positions. The 
selection on some motif positions is as high as protein-coding exons 
(Fig. 1f; L. Ward and M. Kellis, manuscript submitted). These 
aggregate measures across motifs show that the binding preferences 
found in the population of sites are also relevant to the per-site beha- 
viour. By developing a per-site metric of population effect on bound 
motifs, we found that highly constrained bound instances across 
mammals are able to buffer the impact of individual variation”. 


ENCODE data integration with known genomic features 
Promoter-anchored integration 

Many of the ENCODE assays directly or indirectly provide informa- 
tion about the action of promoters. Focusing on the TSSs of protein- 
coding transcripts, we investigated the relationships between different 
ENCODE assays, in particular testing the hypothesis that RNA 
expression (output) can be effectively predicted from patterns of 
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chromatin modification or transcription factor binding (input). 
Consistent with previous reports*’, we observe two relatively distinct 
types of promoter: (1) broad, mainly (C+G)-rich, TATA-less promoters; 
and (2) narrow, TATA-box-containing promoters. These promoters 
have distinct patterns of histone modifications, and transcription-fac- 
tor-binding sites are selectively enriched in each class (Supplementary 
Fig. 1, section Z). 

We developed predictive models to explore the interaction between 
histone modifications and measures of transcription at promoters, 
distinguishing between modifications known to be added as a con- 
sequence of transcription (such as H3K36me3 and H3K79me2) and 
other categories of histone marks”. In our analyses, the best models 
had two components: an initial classification component (on/off) anda 
second quantitative model component. Our models showed that 
activating acetylation marks (H3K27ac and H3K9ac) are roughly 
as informative as activating methylation marks (H3K4me3 and 
H3K4mez2) (Fig. 2a). Although repressive marks, such as H3K27me3 


or H3K9me3, show negative correlation both individually and in the 
model, removing these marks produces only a small reduction in 
model performance. However, for a subset of promoters in each cell 
line, repressive histone marks (H3K27me3 or H3K9me3) must be used 
to predict their expression accurately. We also examined the interplay 
between the H3K79me2 and H3K36me3 marks, both of which mark 
gene bodies, probably reflecting recruitment of modification enzymes 
by polymerase isoforms. As described previously, H3K79me2 occurs 
preferentially at the 5’ ends of gene bodies and H3K36me3 occurs 
more 3’, and our analyses support the previous model in which the 
H3K79me2 to H3K36me3 transition occurs at the first 3’ splice site. 

Few previous studies have attempted to build qualitative or quant- 
itative models of transcription genome-wide from transcription 
factor levels because of the paucity of documented transcription- 
factor-binding regions and the lack of coordination around a single 
cell line. We thus examined the predictive capacity of transcription- 
factor-binding signals for the expression levels of promoters (Fig. 2b). 
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Figure 2 | Modelling transcription levels from histone modification and 
transcription-factor-binding patterns. a, b, Correlative models between 
either histone modifications or transcription factors, respectively, and RNA 
production as measured by CAGE tag density at TSSs in K562 cells. In each case 
the scatter plot shows the output of the correlation models (x axis) compared to 
observed values (y axis). The bar graphs show the most important histone 
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modifications (a) or transcription factors (b) in both the initial classification 
phase (top bar graph) or the quantitative regression phase (bottom bar graph), 
with larger values indicating increasing importance of the variable in the model. 
Further analysis of other cell lines and RNA measurement types is reported 
elsewhere*”’”’. AUC, area under curve; Gini, Gini coefficient; RMSE, root mean 
square error. 
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In contrast to the profiles of histone modifications, most transcription 
factors show enriched binding signals in a narrow DNA region near 
the TSS, with relatively higher binding signals in promoters with 
higher CpG content. Most of this correlation could be recapitulated 
by looking at the aggregate binding of transcription factors without 
specific transcription factor terms. Together, these correlation models 
indicate both that a limited set of chromatin marks are sufficient to 
‘explain’ transcription and that a variety of transcription factors might 
have broad roles in general transcription levels across many genes. It is 
important to note that this is an inherently observational study of 
correlation patterns, and is consistent with a variety of mechanistic 
models with different causal links between the chromatin, transcrip- 
tion factor and RNA assays. However, it does indicate that there is 
enough information present at the promoter regions of genes to 
explain most of the variation in RNA expression. 

We developed predictive models similar to those used to model 
transcriptional activity to explore the relationship between levels of 
histone modification and inclusion of exons in alternately spliced 
transcripts. Even accounting for expression level, H3K36me3 has a 
positive contribution to exon inclusion, whereas H3K79me2 has a 
negative contribution (H. Tilgner et al., manuscript in preparation). 
By monitoring the RNA populations in the subcellular fractions of 
K562 cells, we found that essentially all splicing is co-transcriptional®, 
further supporting a link between chromatin structure and splicing. 


Transcription-factor-binding site-anchored integration 
Transcription-factor-binding sites provide a natural focus around 
which to explore chromatin properties. Transcription factors are often 
multifunctional and can bind a variety of genomic loci with different 
combinations and patterns of chromatin marks and nucleosome organ- 
ization. Hence, rather than averaging chromatin mark profiles across all 
binding sites of a transcription factor, we developed a clustering pro- 
cedure, termed the Clustered Aggregation Tool (CAGT), to identify 
subsets of binding sites sharing similar but distinct patterns of chro- 
matin mark signal magnitude, shape and hidden directionality*®. For 
example, the average profile of the repressive histone mark H3K27me3 
over all 55,782 CT'CF-binding sites in H1 hESCs shows poor signal 
enrichment (Fig. 3a). However, after grouping profiles by signal 
magnitude we found a subset of 9,840 (17.6%) CTCF-binding sites 
that exhibit significant flanking H3K27me3 signal. Shape and orienta- 
tion analysis further revealed that the predominant signal profile for 
H3K27me3 around CTCF peak summits is asymmetric, consistent 
with a boundary role for some CTCF sites between active and 
polycomb-silenced domains. Further examples are provided in 
Supplementary Figs 5 and 6 of section E. For TAF1, predominantly 
found near TSSs, the asymmetric sites are orientated with the direction 
of transcription. However, for distal sites, such as those bound by 
GATAI and CTCF, we also observed a high proportion of asymmetric 
histone patterns, although independent of motif directionality. In fact, 
all transcription-factor-binding data sets in all cell lines show 
predominantly asymmetric patterns (asymmetry ratio >0.6) for all 
chromatin marks but not for DNase I signal (Fig. 3b). This indicates 
that most transcription-factor-bound chromatin events correlate with 
structured, directional patterns of histone modifications, and that pro- 
moter directionality is not the only source of orientation at these sites. 
We also examined nucleosome occupancy relative to the symmetry 
properties of chromatin marks around transcription-factor-binding 
sites. Around TSSs, there is usually strong asymmetric nucleosome 
occupancy, often accounting for most of the histone modification 
signal (for instance, see Supplementary Fig. 4, section E). However, 
away from TSSs, there is far less concordance. For example, CTCF- 
binding sites typically show arrays of well-positioned nucleosomes on 
either side of the peak summit (Supplementary Fig. 1, section E)*. 
Where the flanking chromatin mark signal is high, the signals are 
often asymmetric, indicating differential marking with histone 
modifications (Supplementary Figs 2 and 3, section E). Thus, we 
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Figure 3 | Patterns and asymmetry of chromatin modification at 
transcription-factor-binding sites. a, Results of clustered aggregation of 
H3K27me3 modification signal around CTCF-binding sites (a multifunctional 
protein involved with chromatin structure). The first three plots (left column) 
show the signal behaviour of the histone modification over all sites (top) and 
then split into the high and low signal components. The solid lines show the 
mean signal distribution by relative position with the blue shaded area 
delimiting the tenth and ninetieth percentile range. The high signal component 
is then decomposed further into six different shape classes on the right (see ref. 
30 for details). The shape decomposition process is strand aware. b, Summary 
of shape asymmetry for DNase I, nucleosome and histone modification signals 
by plotting an asymmetry ratio for each signal over all transcription-factor- 
binding sites. All histone modifications measured in this study show 
predominantly asymmetric patterns at transcription-factor-binding sites. An 
interactive version of this figure is available in the online version of the paper. 


confirm on a genome-wide scale that transcription factors can form 
barriers around which nucleosomes and histone modifications are 
arranged in a variety of configurations. This is explored in further 
detail in refs 25, 26 and 30. 


Transcription factor co-associations 

Transcription-factor-binding regions are nonrandomly distributed 
across the genome, with respect to both other features (for example, 
promoters) and other transcription-factor-binding regions. Within the 
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Figure 4 | Co-association between transcription factors. a, Significant co- 
associations of transcription factor pairs using the GSC statistic across the entire 
genome in K562 cells. The colour strength represents the extent of association 
(from red (strongest), orange, to yellow (weakest)), whereas the depth of colour 
represents the fit to the GSC”? model (where white indicates that the statistical 
model is not appropriate) as indicated by the key. Most transcription factors have 
anonrandom association to other transcription factors, and these associations are 
dependent on the genomic context, meaning that once the genome is separated 
into promoter proximal and distal regions, the overall levels of co-association 


tier 1 and 2 cell lines, we found 3,307 pairs of statistically co-associated 


factors (P<1 X 10 '°, GSC) involving 114 out of a possible 117 factors 
(97%) (Fig. 4a). These include expected associations, such as Jun and 


Table 3 | Summary of the combined state types 
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decrease, but more specific relationships are uncovered. b, Three classes of 
behaviour are shown. The first column shows a set of associations for which 
strength is independent of location in promoter and distal regions, whereas the 
second column shows a set of transcription factors that have stronger associations 
in promoter-proximal regions. Both of these examples are from data in K562 cells 
and are highlighted on the genome-wide co-association matrix (a) by the labelled 
boxes A and B, respectively. The third column shows a set of transcription factors 
that show stronger association in distal regions (in the H1 hESC line). An 
interactive version of this figure is available in the online version of the paper. 


Fos, and some less expected novel associations, such as TCF7L2 with 
HNF4-« and FOXA2 (ref. 66; a full listing is given in Supplementary 
Table 1, section F). When one considers promoter and intergenic 


Label 
CTCF 


Description 


CTCF-enriched element 


Details* Colour 


Sites of CTCF signal lacking histone modifications, often associated with open chromatin. Many Turquoise 


probably have a function in insulator assays, but because of the multifunctional nature of CTCF, we 
are conservative in our description. Also enriched for the cohesin components RAD21 and SMC3; 
CTCF is known to recruit the cohesin complex. 

Regions of open chromatin associated with H3K4mel1 signal. Enriched for other enhancer- 
associated marks, including transcription factors known to act at enhancers. In enhancer assays, 
many of these (>50%) function as enhancers. A more conservative alternative would be cis- 
regulatory regions. Enriched for sites for the proteins encoded by EP300, FOS, FOSL1, GATA2, 
HDAC8, JUNB, JUND, NFE2, SMARCA4, SMARCB1, SIRT6 and TAL1 genes in K562 cells. Have 
nuclear and whole-cell RNA signal, particularly poly(A)— fraction. 

Predicted promoter flanking region Regions that generally surround TSS segments (see below). 
Predicted repressed or low-activity region This isa merged state that includes H3K27me3 polycomb-enriched regions, along with regions that 
are silent in terms of observed signal for the input assays to the segmentations (low or no signal). 
They may have other signals (for example, RNA, not in the segmentation input data). Enriched for 
sites for the proteins encoded by REST and some other factors (for example, proteins encoded by 
BRF2, CEBPB, MAFK, TRIM28, ZNF274 and SETDB1 genes in K562 cells). 

Found close to or overlapping GENCODE TSS sites. High precision/recall for TSSs. Enriched for 
H3K4me3. Sites of open chromatin. Enriched for transcription factors known to act close to promoters 
and polymerases Pol II and Pol Ill. Short RNAs are most enriched in these segments. 
Overlap gene bodies with H3K36me3 transcriptional elongation signal. Enriched for phosphorylated 
form of Pol Il signal (elongating polymerase) and poly(A)* RNA, especially cytoplasmic. 

Similar to the E state, but weaker signals and weaker enrichments. 


Predicted enhancer 


TSS Predicted promoter region including TSS 


T Predicted transcribed region 


WE Predicted weak enhancer or open 


chromatin cis-regulatory element 


* Where specific enrichments or overlaps are identified, these are derived from analysis in GM12878 and/or K562 cells where the data for comparison is richest. The colours in 
display of these tracks from the ENCODE data hub. 
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Figure 5 | Integration of ENCODE data by genome-wide segmentation. 

a, Illustrative region with the two segmentation methods (ChromHMM and 
Segway) in a dense view and the combined segmentation expanded to show 
each state in GM12878 cells, beneath a compressed view of the GENCODE 
gene annotations. Note that at this level of zoom and genome browser 
resolution, some segments appear to overlap although they do not. 
Segmentation classes are named and coloured according to the scheme in 
Table 3. Beneath the segmentations are shown each of the normalized signals 
that were used as the input data for the segmentations. Open chromatin signals 
from DNase-seq from the University of Washington group (UW DNase) or the 
ENCODE open chromatin group (Openchrom DNase) and FAIRE assays are 
shown in blue; signal from histone modification ChIP-seq in red; and 
transcription factor ChIP-seq signal for Pol II and CTCF in green. The mauve 


ChIP-seq control signal (input control) at the bottom was also included as an 
input to the segmentation. b, Association of selected transcription factor (left) 
and RNA (right) elements in the combined segmentation states (x axis) 
expressed as an observed/expected ratio (obs./exp.) for each combination of 
transcription factor or RNA element and segmentation class using the heat- 
map scale shown in the key besides each heat map. ¢, Variability of states 
between cell lines, showing the distribution of occurrences of the state in the six 
cell lines at specific genome locations: from unique to one cell line to ubiquitous 
in all six cell lines for five states (CTCF, E, T, TSS and R). d, Distribution of 
methylation level at individual sites from RRBS analysis in GM12878 cells 
across the different states, showing the expected hypomethylation at TSSs and 
hypermethylation of genes bodies (T state) and repressed (R) regions. 
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regions separately, this changes to 3,201 pairs (116 factors, 99%) for 
promoters and 1,564 pairs (108 factors, 92%) for intergenic regions, 
with some associations more specific to these genomic contexts (for 
example, the cluster of HDAC2, GABPA, CHD2, GTF2F1, MXI1 and 
MYC in promoter regions and SP1, EP300, HDAC2 and NANOG in 
intergenic regions (Fig. 4b)). These general and context-dependent 
associations lead to a network representation of the co-binding with 
many interesting properties, explored in refs 19, 25 and 26. In addition, 
we also identified a set of regions bound by multiple factors represent- 
ing high occupancy of transcription factor (HOT) regions”. 


Genome-wide integration 


To identify functional regions genome-wide, we next integrated ele- 
ments independent of genomic landmarks using either discriminative 
training methods, where a subset of known elements ofa particular class 
were used to train a model that was then used to discover more instances 
of this class, or using methods in which only data from ENCODE assays 
were used without explicit knowledge of any annotation. 

For discriminative training, we used a three-step process to predict 
potential enhancers, described in Supplementary Information and 
ref. 67. Two alternative discriminative models converged on a set of 
~13,000 putative enhancers in K562 cells®’. In the second approach, 
two methodologically distinct unbiased approaches (see refs 40, 68 
and M. M. Hoffman et al., manuscript in preparation) converged on a 
concordant set of histone modification and chromatin-accessibility 
patterns that can be used to segment the genome in each of the tier 1 
and tier 2 cell lines, although the individual loci in each state in each 
cell line are different. With the exception of RNA polymerase II and 
CTCF, the addition of transcription factor data did not substantially 
alter these patterns. At this stage, we deliberately excluded RNA and 
methylation assays, reserving these data as a means to validate the 
segmentations. 

Our integration of the two segmentation methods (M. M. Hoffman 
et al., manuscript in preparation) established a consensus set of seven 
major classes of genome states, described in Table 3. The standard 
view of active promoters, with a distinct core promoter region (TSS 
and PF states), leading to active gene bodies (T, transcribed state), is 
rediscovered in this model (Fig. 5a, b). There are three ‘active’ distal 
states. We tentatively labelled two as enhancers (predicted enhancers, 
E, and predicted weak enhancers, WE) due to their occurrence in 
regions of open chromatin with high H3K4mel, although they differ 
in the levels of marks such as H3K27ac, currently thought to 
distinguish active from inactive enhancers. The other active state 
(CTCF) has high CTCF binding and includes sequences that function 
as insulators in a transfection assay. The remaining repressed state (R) 
summarizes sequences split between different classes of actively 
repressed or inactive, quiescent chromatin. We found that the 
CTCF-binding-associated state is relatively invariant across cell types, 
with individual regions frequently occupying the CTCF state across all 
six cell types (Fig. 5c). Conversely, the E and T states have substantial 
cell-specific behaviour, whereas the TSS state has a bimodal behaviour 
with similar numbers of cell-invariant and cell-specific occurrences. 
It is important to note that the consensus summary classes do not 
capture all the detail discovered in the individual segmentations con- 
taining more states. 

The distribution of RNA species across segments is quite distinct, 
indicating that underlying biological activities are captured in the 
segmentation. Polyadenylated RNA is heavily enriched in gene 
bodies. Around promoters, there are short RNA species previously 
identified as promoter-associated short RNAs (Fig. 5b)’*’. Similarly, 
DNA methylation shows marked distinctions between segments, 
recapitulating the known biology of predominantly unmethylated 
active promoters (TSS states) followed by methylated gene bodies* 
(T state, Fig. 5d). The two enhancer-enriched states show distinct 
patterns of DNA methylation, with the less active enhancer state 
(by H3K27ac/H3K4mel levels) showing higher methylation. These 
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states also have an excess of RNA elements without poly(A) tails and 
methyl-cap RNA, as assayed by CAGE sequences, compared to 
matched intergenic controls, indicating a specific transcriptional 
mode associated with active enhancers”. Transcription factors also 
showed distinct distributions across the segments (Fig. 5b). A striking 
pattern is the concentration of transcription factors in the TSS- 
associated state. The enhancers contain a different set of transcription 
factors. For example, in K562 cells, the E state is enriched for binding 
by the proteins encoded by the EP300, FOS, FOSL1, GATA2, HDACS8, 
JUNB, JUND, NFE2, SMARCA4, SMARCB1, SIRT6 and TALI genes. 
We tested a subset of these predicted enhancers in both mouse and 
fish transgenic models (examples in Fig. 6), with over half of the 
elements showing activity, often in the corresponding tissue type. 

The segmentation provides a linear determination of functional 
state across the genome, but not an association of particular distal 
regions with genes. By using the variation of DNase I signal across cell 
lines, 39% of E (enhancer associated) states could be linked to a 
proposed regulated gene” concordant with physical proximity 
patterns determined by 5C* or ChIA-PET. 

To provide a fine-grained regional classification, we turned to a self 
organizing map (SOM) to cluster genome segmentation regions based 
on their assay signal characteristics (Fig. 7). The segmentation regions 
were initially randomly assigned to a 1,350-state map in a two- 
dimensional toroidal space (Fig. 7a). This map can be visualized as 
a two-dimensional rectangular plane onto which the various signal 
distributions can be plotted. For instance, the rectangle at the bottom 
left of Fig. 7a shows the distribution of the genome in the initial 
randomized map. The SOM was then trained using the twelve differ- 
ent ChIP-seq and DNase-seq assays in the six cell types previously 
analysed in the large-scale segmentations (that is, over 72-dimensional 
space). After training, the SOM clustering was again visualized in two 
dimensions, now showing the organized distribution of genome seg- 
ments (lower right of panel, Fig. 7a). Individual data sets associated 
with the genome segments in each SOM map unit (hexagonal cells) 
can then be visualized in the same framework to learn how each 
additional kind of data is distributed on the chromatin state map. 
Figure 7b shows CAGE/TSS expression data overlaid on the randomly 
initialized (left) and trained map (right) panels. In this way the trained 
SOM highlighted cell-type-specific TSS clusters (bottom panels of 
Fig. 7b), indicating that there are sets of tissue-specific TSSs that are 
distinguished from each other by subtle combinations of ENCODE 


Figure 6 | Experimental characterization of segmentations. Randomly 
sampled E state segments (see Table 3) from the K562 segmentation were 
cloned for mouse- and fish-based transgenic enhancer assays. a, Representative 
LacZ-stained transgenic embryonic day (E)11.5 mouse embryo obtained with 
construct hs2065 (EN167, chr10: 46052882-46055670, GRCh37). Highly 
reproducible staining in the blood vessels was observed in 9 out of 9 embryos 
resulting from independent transgenic integration events. b, Representative 
green fluorescent protein reporter transgenic medaka fish obtained from a 
construct with a basal hsp70 promoter on meganuclease-based transfection. 
Reproducible transgenic expression in the circulating nucleated blood cells and 
the endothelial cell walls was seen in 81 out of 100 transgenic tests of this 
construct. 
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Figure 7 | High-resolution segmentation of 
ENCODE data by self-organizing maps (SOM). 
a-c, The training of the SOM (a) and analysis of the 
results (b, c) are shown. Initially we arbitrarily placed 
genomic segments from the ChromHMM 
segmentation on to the toroidal map surface, 
although the SOM does not use the ChromHMM 
state assignments (a). We then trained the map 
using the signal of the 12 different ChIP-seq and 
_h» DNase-seq assays in the six cell types analysed. Each 


—>" ———" —_ = unit of the SOM is represented here by a hexagonal 
, SS = ed = cell in a planar two-dimensional view of the toroidal 
Beadorniy itsiized Tap map. Curved arrows indicate that traversing the 
Empty map eed DNA content Trained map edges of two dimensional view leads back to the 
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distribution of that data within this high-resolution 
segmentation. In panel a the distributions of genome 
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bases across the untrained and trained map (left and 
right, respectively) are shown using heat-map 
colours for log), values. b, The distribution of TSSs 
from CAGE experiments of GENCODE annotation 
on the planar representations of either the initial 
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random organization (left) or the final trained SOM 
(right) using heat maps coloured according to the 
accompanying scales. The bottom half of b expands 
the different distributions in the SOM for all 
expressed TSSs (left) or TSSs specifically expressed 
in two example cell lines, H1 hESC (centre) and 
HepGz? (right). c, The association of Gene Ontology 
(GO) terms on the same representation of the same 
trained SOM. We assigned genes that are within 
20 kb of a genomic segment in a SOM unit to that 
unit, and then associated this set of genes with GO 
terms using a hypergeometric distribution after 
correcting for multiple testing. Map units that are 
significantly associated to GO terms are coloured 
green, with increasing strength of colour reflecting 
increasing numbers of genes significantly associated 
with the GO terms for either immune response (left) 
or sequence-specific transcription factor activity 
(centre). In each case, specific SOM units show 
association with these terms. The right-hand panel 
shows the distribution on the same SOM of all 
significantly associated GO terms, now colouring by 
GO term count per SOM unit. For sequence-specific 
transcription factor activity, two example genomic 
regions are extracted at the bottom of panel ¢ from 
neighbouring SOM units. These are regions around 
the DBX1 (from SOM unit 26,31, left panel) and 
IRX6 (SOM unit 27,30, right panel) genes, 
respectively, along with their H3K27me3 ChIP-seq 
signal for each of the tier 1 and 2 cell types. For 
DBX1, representative of a set of primarily neuronal 
transcription factors associated with unit 26,31, 
there is a repressive H3K27me3 signal in both H1 
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chromatin data. Many of the ultra-fine-grained state classifications 
revealed in the SOM are associated with specific gene ontology (GO) 
terms (right panel of Fig. 7c). For instance, the left panel of Fig. 7c 
identifies ten SOM map units enriched with genomic regions 
associated with genes associated with the GO term ‘immune response’. 
The central panel identifies a different set of map units enriched for the 
GO term ‘sequence-specific transcription factor activity’. The two 
map units most enriched for this GO term, indicated by the darkest 
green colouring, contain genes with segments that are high in 


hESCs and HUVECs; for IRX6, representative of a 
set of body patterning transcription factors 
associated with SOM unit 27,30, the repressive mark 
is restricted largely to the embryonic stem (ES) cell. 
An interactive version of this figure is available in the 
online version of the paper. 


H3K27me3 in H1 hESCs, but that differ in H3K27me3 levels in 
HUVECs. Gene function analysis with the GO ontology tool 
(GREAT”’) reveals that the map unit with high H3K27me3 levels in 
both cell types is enriched in transcription factor genes with known 
neuronal functions, whereas the neighbouring map unit is enriched in 
genes involved in body patterning. The genome browser shots at the 
bottom of Fig. 7c pick out an example region for each of the two SOM 
map units illustrating the difference in H3K27me3 signal. Overall, we 
have 228 distinct GO terms associated with specific segments across 
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one or more states (A. Mortazavi, personal communication), and can 
assign over one-third of genes to a GO annotation solely on the basis of 
its multicellular histone patterns. Thus, the SOM analysis provides a 
fine-grained map of chromatin data across multiple cell types, which 
can then be used to relate chromatin structure to other data types at 
differing levels of resolution (for instance, the large cluster of units 
containing any active TSS, its subclusters composed of units enriched 
in TSSs active in only one cell type, or individual map units signifi- 
cantly enriched for specific GO terms). 

The classifications presented here are necessarily limited by the 
assays and cell lines studied, and probably contain a number of 
heterogeneous classes of elements. Nonetheless, robust classifications 
can be made, allowing a systematic view of the human genome. 


Insights into human genomic variation 


We next explored the potential impact of sequence variation on 
ENCODE functional elements. We examined allele-specific variation 
using results from the GM12878 cells that are derived from an indi- 
vidual (NA12878) sequenced in the 1000 Genomes project, along with 
her parents. Because ENCODE assays are predominantly sequence- 
based, the trio design allows each GM12878 data set to be divided by 
the specific parental contributions at heterozygous sites, producing 
aggregate haplotypic signals from multiple genomic sites. We 
examined 193 ENCODE assays for allele-specific biases using 
1,409,992 phased, heterozygous SNPs and 167,096 insertions/dele- 
tions (indels) (Fig. 8). Alignment biases towards alleles present in 
the reference genome sequence were avoided using a sequence 
specifically tailored to the variants and haplotypes present in 
NA12878 (a ‘personalized genome’)”. We found instances of pref- 
erential binding towards each parental allele. For example, com- 
parison of the results from the POLR2A, H3K79me2 and H3K27me3 
assays in the region of NACC2 (Fig. 8a) shows a strong paternal bias for 
H3K79me2 and POL2RA and a strong maternal bias for H3K27me3, 
indicating differential activity for the maternal and paternal alleles. 

Figure 8b shows the correlation of selected allele-specific signals 
across the whole genome. For instance, we found a strong allelic 
correlation between POL2RA and BCLAF! binding, as well as nega- 
tive correlation between H3K79me2 and H3K27me3, both at genes 
(Fig. 8b, below the diagonal, bottom left) and chromosomal segments 
(top right). Overall, we found that positive allelic correlations among 
the 193 ENCODE assays are stronger and more frequent than nega- 
tive correlations. This may be due to preferential capture of accessible 
alleles and/or the specific histone modification and transcription 
factor, assays used in the project. 


Rare variants, individual genomes and somatic variants 


We further investigated the potential functional effects of individual 
variation in the context of ENCODE annotations. We divided 
NA12878 variants into common and rare classes, and partitioned 
these into those overlapping ENCODE annotation (Fig. 9a and 
Supplementary Tables 1 and 2, section K). We also predicted potential 
functional effects: for protein-coding genes, these are either non- 
synonymous SNPs or variants likely to induce loss of function by 
frame-shift, premature stop, or splice-site disruption; for other 
regions, these are variants that overlap a transcription-factor- 
binding site. We found similar numbers of potentially functional 
variants affecting protein-coding genes or affecting other ENCODE 
annotations, indicating that many functional variants within 
individual genomes lie outside exons of protein-coding genes. A more 
detailed analysis of regulatory variant annotation is described in 
ref. 73. 

To study further the potential effects of NA12878 genome variants 
on transcription-factor-binding regions, we performed peak calling 
using a constructed personal diploid genome sequence for NA12878 
(ref. 72). We aligned ChIP-seq sequences from GM12878 separately 
against the maternal and paternal haplotypes. As expected, a greater 
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Figure 8 | Allele-specific ENCODE elements. a, Representative allele-specific 
information from GM12878 cells for selected assays around the first exon of the 
NACC2 gene (genomic region Chr9: 138950000-138995000, GRCh37). 
Transcription signal is shown in green, and the three sections show allele- 
specific data for three data sets (POLR2A, H3K79me2 and H3K27me3 ChIP- 
seq). In each case the purple signal is the processed signal for all sequence reads 
for the assay, whereas the blue and red signals show sequence reads specifically 
assigned to either the paternal or maternal copies of the genome, respectively. 
The set of common SNPs from dbSNP, including the phased, heterozygous 
SNPs used to provide the assignment, are shown at the bottom of the panel. 
NACC2Z has a statistically significant paternal bias for POLR2A and the 
transcription-associated mark H3K79mez2, and has a significant maternal bias 
for the repressive mark H3K27me3. b, Pair-wise correlations of allele-specific 
signal within single genes (below the diagonal) or within individual 
ChromHMM segments across the whole genome for selected DNase-seq and 
histone modification and transcription factor ChIP-seq assays. The extent of 
correlation is coloured according to the heat-map scale indicated from positive 
correlation (red) through to anti-correlation (blue). An interactive version of 
this figure is available in the online version of the paper. 


fraction of reads were aligned than to the reference genome (see 
Supplementary Information, Supplementary Fig. 1, section K). On 
average, approximately 1% of transcription-factor-binding sites in 
GM12878 cells are detected in a haplotype-specific fashion. For 
instance, Fig. 9b shows a CTCF-binding site not detected using the 


©2012 Macmillan Publishers Limited. All rights reserved 


ARTICLE 


Protein-coding annotation 


All variants in NA12878 (2,998,908) 


<4... with predicted functional effect 
= ENCODE non-coding annotation 


... with predicted functional effect 


Rare (86,420) 


(23,227) 


. 


Common annotated 


Rare variants 
with predicted 
functional effect 


Rare annotated 


(1,482) 


Common variants 
with predicted 
functional effect (27,940) 


(450,129) 
b 
20 
c 
Non-genic TSS-distal cell-specific DHS peaks 
15 — Paternal 
— Maternal 5 20 
€ Oo 
3 10 35 
S 3 2] 
ne) 
eo 5 zg 
® = 04 
ac 55 
0 ago P<0.05 
L104 Depleted 
g 7 Enriched 
5 3 NA 
: > -204 
126665000 126666000 126667000 126668000 126669000 126670000 + 126671000 eR sa a a a a a Oe 
DLO PP KOM PO DKS GLO BS JE SAWN PF J; OY 
as : x 
Genome position on chromosome 10 Na SF ee we BN NE 
a 
DHS cell type 


Figure 9 | Examining ENCODE elements on a per individual basis in the 
normal and cancer genome. a, Breakdown of variants in a single genome 
(NA12878) by both frequency (common or rare (that is, variants not present in 
the low-coverage sequencing of 179 individuals in the pilot 1 European panel of 
the 1000 Genomes project”’)) and by ENCODE annotation, including protein- 
coding gene and non-coding elements (GENCODE annotations for protein- 
coding genes, pseudogenes and other ncRNAs, as well as transcription-factor- 
binding sites from ChIP-seq data sets, excluding broad annotations such as 
histone modifications, segmentations and RNA-seq). Annotation status is 
further subdivided by predicted functional effect, being non-synonymous and 
missense mutations for protein-coding regions and variants overlapping bound 


reference sequence that is only present on the paternal haplotype 
due to a 1-bp deletion (see also Supplementary Fig. 2, section K). 
As costs of DNA sequencing decrease further, optimized analysis of 
ENCODE-type data should use the genome sequence of the indi- 
vidual or cell being analysed when possible. 

Most analyses of cancer genomes so far have focused on character- 
izing somatic variants in protein-coding regions. We intersected four 
available whole-genome cancer data sets with ENCODE annotations 
(Fig. 9c and Supplementary Fig. 2, section L). Overall, somatic variation 
is relatively depleted from ENCODE annotated regions, particularly for 
elements specific to a cell type matching the putative tumour source (for 
example, skin melanocytes for melanoma). Examining the mutational 
spectrum of elements in introns for cases where a strand-specific 
mutation assignment could be made reveals that there are mutational 
spectrum differences between DHSs and unannotated regions (0.06 
Fisher’s exact test, Supplementary Fig. 3, section L). The suppression 
of somatic mutation is consistent with important functional roles of 
these elements within tumour cells, highlighting a potential alternative 
set of targets for examination in cancer. 


Common variants associated with disease 


In recent years, GWAS have greatly extended our knowledge of 
genetic loci associated with human disease risk and other phenotypes. 


transcription factor motifs for non-coding element annotations. A substantial 
proportion of variants are annotated as having predicted functional effects in 
the non-coding category. b, One of several relatively rare occurrences, where 
alignment to an individual genome sequence (paternal and maternal panels) 
shows a different readout from the reference genome. In this case, a paternal- 
haplotype-specific CTCF peak is identified. c, Relative level of somatic variants 
from a whole-genome melanoma sample that occur in DHSs unique to 
different cell lines. The coloured bars show cases that are significantly enriched 
or suppressed in somatic mutations. Details of ENCODE cell types can be 
found at http://encodeproject.org/ENCODE/cellTypes.html. An interactive 
version of this figure is available in the online version of the paper. 


The output of these studies is a series of SNPs (GWAS SNPs) corre- 
lated with a phenotype, although not necessarily the functional 
variants. Notably, 88% of associated SNPs are either intronic or 
intergenic’*. We examined 4,860 SNP-phenotype associations for 
4,492 SNPs curated in the National Human Genome Research 
Institute (NHGRI) GWAS catalogue”*. We found that 12% of these 
SNPs overlap transcription-factor-occupied regions whereas 34% over- 
lap DHSs (Fig. 10a). Both figures reflect significant enrichments relative 
to the overall proportions of 1000 Genomes project SNPs (about 6% and 
23%, respectively). Even after accounting for biases introduced by selec- 
tion of SNPs for the standard genotyping arrays, GWAS SNPs show 
consistently higher overlap with ENCODE annotations (Fig. 10a, see 
Supplementary Information). Furthermore, after partitioning the 
genome by density of different classes of functional elements, GWAS 
SNPs were consistently enriched beyond all the genotyping SNPs in 
function-rich partitions, and depleted in function-poor partitions (see 
Supplementary Fig. 1, section M). GWAS SNPs are particularly 
enriched in the segmentation classes associated with enhancers and 
TSSs across several cell types (see Supplementary Fig. 2, section M). 
Examining the SOM of integrated ENCODE annotations (see 
above), we found 19 SOM map units showing significant enrichment 
for GWAS SNPs, including many SOM units previously associated 
with specific gene functions, such as the immune response regions. 
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Thus, an appreciable proportion of SNPs identified in initial GWAS 
scans are either functional or lie within the length of an ENCODE 
annotation (~500 bp on average) and represent plausible candidates 
for the functional variant. Expanding the set of feasible functional 
SNPs to those in reasonable linkage disequilibrium, up to 71% of 
GWAS SNPs have a potential causative SNP overlapping a DNase I 
site, and 31% of loci have a candidate SNP that overlaps a binding site 
occupied by a transcription factor (see also refs 73, 75). 

The GWAS catalogue provides a rich functional categorization 
from the precise phenotypes being studied. These phenotypic cate- 
gorizations are nonrandomly associated with ENCODE annotations 
and there is marked correspondence between the phenotype and the 
identity of the cell type or transcription factor used in the ENCODE 
assay (Fig. 10b). For example, five SNPs associated with Crohn’s 
disease overlap GATA2-binding sites (P value 0.003 by random 
permutation or 0.001 by an empirical approach comparing to 
the GWAS-matched SNPs; see Supplementary Information), and 
fourteen are located in DHSs found in immunologically relevant cell 


types. A notable example is a gene desert on chromosome 5p13.1 
containing eight SNPs associated with inflammatory diseases. 
Several are close to or within DHSs in T-helper type 1 (Ty1) and 
Ty2 cells as well as peaks of binding by transcription factors in 
HUVECs (Fig. 10c). The latter cell line is not immunological, but 
factor occupancy detected there could be a proxy for binding of a 
more relevant factor, such as GATA3, in T cells. Genetic variants in 
this region also affect expression levels of PTGER4 (ref. 76), encoding 
the prostaglandin receptor EP4. Thus, the ENCODE data reinforce 
the hypothesis that genetic variants in 5p13.1 modulate the expression 
of flanking genes, and furthermore provide the specific hypothesis 
that the variants affect occupancy of a GATA factor in an allele- 
specific manner, thereby influencing susceptibility to Crohn’s disease. 

Nonrandom association of phenotypes with ENCODE cell types 
strengthens the argument that at least some of the GWAS lead SNPs 
are functional or extremely close to functional variants. Each of the 
associations between a lead SNP and an ENCODE annotation 
remains a credible hypothesis of a particular functional element 
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Figure 10 | Comparison of genome-wide-association-study-identified loci 
with ENCODE data. a, Overlap of lead SNPs in the NHGRI GWAS SNP 
catalogue (June 2011) with DHSs (left) or transcription-factor-binding sites 
(right) as red bars compared with various control SNP sets in blue. The control 
SNP sets are (from left to right): SNPs on the Illumina 2.5M chip as an example 
of a widely used GWAS SNP typing panel; SNPs from the 1000 Genomes 
project; SNPs extracted from 24 personal genomes (see personal genome 
variants track at http://main.genome-browser.bx.psu.edu (ref. 80)), all shown 
as blue bars. In addition, a further control used 1,000 randomizations from the 
genotyping SNP panel, matching the SNPs with each NHGRI catalogue SNP 
for allele frequency and distance to the nearest TSS (light blue bars with bounds 
at 1.5 times the interquartile range). For both DHSs and transcription-factor- 
binding regions, a larger proportion of overlaps with GWAS-implicated SNPs 
is found compared to any of the controls sets. b, Aggregate overlap of 
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phenotypes to selected transcription-factor-binding sites (left matrix) or DHSs 
in selected cell lines (right matrix), with a count of overlaps between the 
phenotype and the cell line/factor. Values in blue squares pass an empirical 
P-value threshold =0.01 (based on the same analysis of overlaps between 
randomly chosen, GWAS-matched SNPs and these epigenetic features) and 
have at least a count of three overlaps. The P value for the total number of 
phenotype-transcription factor associations is <0.001. c, Several SNPs 
associated with Crohn’s disease and other inflammatory diseases that reside ina 
large gene desert on chromosome 5, along with some epigenetic features 
indicative of function. The SNP (rs11742570) strongly associated to Crohn’s 
disease overlaps a GATA2 transcription-factor-binding signal determined in 
HUVECs. This region is also DNase I hypersensitive in HUVECs and T-helper 
Tyl and Ty2 cells. An interactive version of this figure is available in the online 
version of the paper. 
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class or cell type to explore with future experiments. Supplementary 
Tables 1-3, section M, list all 14,885 pairwise associations across the 
ENCODE annotations. The accompanying papers have a more 
detailed examination of common variants with other regulatory 
information!??>??”>”>77, 


Concluding remarks 


The unprecedented number of functional elements identified in this 
study provides a valuable resource to the scientific community as well 
as significantly enhances our understanding of the human genome. 
Our analyses have revealed many novel aspects of gene expression and 
regulation as well as the organization of such information, as illu- 
strated by the accompanying papers (see http://www.encodeproject. 
org/ENCODE/pubs.html for collected ENCODE publications). 
However, there are still many specific details, particularly about the 
mechanistic processes that generate these elements and how and 
where they function, that require additional experiments to elucidate. 

The large spread of coverage—from our highest resolution, most 
conservative set of bases implicated in GENCODE protein-coding 
gene exons (2.9%) or specific protein DNA binding (8.5%) to the 
broadest, most general set of marks covering the genome (approxi- 
mately 80%), with many gradations in between—presents a spectrum 
of elements with different functional properties discovered by 
ENCODE. A total of 99% of the known bases in the genome are within 
1.7 kb of any ENCODE element, whereas 95% of bases are within 8 kb 
of a bound transcription factor motif or DNase I footprint. 
Interestingly, even using the most conservative estimates, the fraction 
of bases likely to be involved in direct gene regulation, even though 
incomplete, is significantly higher than that ascribed to protein- 
coding exons (1.2%), raising the possibility that more information 
in the human genome may be important for gene regulation than 
for biochemical function. Many of the regulatory elements are not 
constrained across mammalian evolution, which so far has been one 
of the most reliable indications of an important biochemical event 
for the organism. Thus, our data provide orthologous indicators for 
suggesting possible functional elements. 

Importantly, for the first time we have sufficient statistical power to 
assess the impact of negative selection on primate-specific elements, 
and all ENCODE classes display evidence of negative selection in these 
unique-to-primate elements. Furthermore, even with our most conser- 
vative estimate of functional elements (8.5% of putative DNA/protein 
binding regions) and assuming that we have already sampled half of the 
elements from our transcription factor and cell-type diversity, one 
would estimate that at a minimum 20% (17% from protein binding 
and 2.9% protein coding gene exons) of the genome participates in these 
specific functions, with the likely figure significantly higher. 

The broad coverage of ENCODE annotations enhances our under- 
standing of common diseases with a genetic component, rare genetic 
diseases, and cancer, as shown by our ability to link otherwise 
anonymous associations to a functional element. ENCODE and 
similar studies provide a first step towards interpreting the rest of 
the genome—beyond protein-coding genes—thereby augmenting 
common disease genetic studies with testable hypotheses. Such 
information justifies performing whole-genome sequencing (rather 
than exome only, 1.2% of the genome) on rare diseases and investi- 
gating somatic variants in non-coding functional elements, for 
instance, in cancer. Furthermore, as GWAS analyses typically asso- 
ciate disease to SNPs in large regions, comparison to ENCODE non- 
coding functional elements can help pinpoint putative causal variants 
in addition to refinement of location by fine-mapping techniques”®. 
Combining ENCODE data with allele-specific information derived 
from individual genome sequences provides specific insight on the 
impact of a genetic variant. Indeed, we believe that a significant goal 
would be to use functional data such as that derived from this project 
to assign every genomic variant to its possible impact on human 
phenotypes. 
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So far, ENCODE has sampled 119 of 1,800 known transcription fac- 
tors and general components of the transcriptional machinery on a 
limited number of cell types, and 13 of more than 60 currently known 
histone or DNA modifications across 147 cell types. DNase I, FAIRE and 
extensive RNA assays across subcellular fractionations have been under- 
taken on many cell types, but overall these data reflect a minor fraction of 
the potential functional information encoded in the human genome. An 
important future goal will be to enlarge this data set to additional factors, 
modifications and cell types, complementing the other related projects 
in this area (for example, Roadmap Epigenomics Project, http:// 
www.roadmapepigenomics.org/, and International Human Epigenome 
Consortium, http://www.ihec-epigenomes.org/). These projects will 
constitute foundational resources for human genomics, allowing a 
deeper interpretation of the organization of gene and regulatory 
information and the mechanisms of regulation, and thereby provide 
important insights into human health and disease. Co-published 
ENCODE-related papers can be explored online via the Nature 
ENCODE explorer (http://www.nature.com/ENCODE), a specially 
designed visualization tool that allows users to access the linked papers 
and investigate topics that are discussed in multiple papers via them- 
atically organized threads. 


METHODS SUMMARY 


For full details of Methods, see Supplementary Information. 
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DNaseI hypersensitive sites (DHSs) are markers of regulatory DNA and have underpinned the discovery of all classes of 
cis-regulatory elements including enhancers, promoters, insulators, silencers and locus control regions. Here we present 
the first extensive map of human DHSs identified through genome-wide profiling in 125 diverse cell and tissue types. We 
identify ~2.9 million DHSs that encompass virtually all known experimentally validated cis-regulatory sequences and 
expose a vast trove of novel elements, most with highly cell-selective regulation. Annotating these elements using 
ENCODE data reveals novel relationships between chromatin accessibility, transcription, DNA methylation and 
regulatory factor occupancy patterns. We connect ~580,000 distal DHSs with their target promoters, revealing 
systematic pairing of different classes of distal DHSs and specific promoter types. Patterning of chromatin accessibility 
at many regulatory regions is organized with dozens to hundreds of co-activated elements, and the transcellular DNase I 
sensitivity pattern at a given region can predict cell-type-specific functional behaviours. The DHS landscape shows 
signatures of recent functional evolutionary constraint. However, the DHS compartment in pluripotent and 
immortalized cells exhibits higher mutation rates than that in highly differentiated cells, exposing an unexpected link 


between chromatin accessibility, proliferative potential and patterns of human variation. 


Cell-selective activation of regulatory DNA 
drives the gene expression patterns that shape 
cell identity. Regulatory DNA is characterized 
by the cooperative binding of sequence-specific 


ENCODE 


Encyclopedia of DNA Elements 
nature.com/encode 


laying the foundations for comprehensive cata- 
logues of human regulatory DNA. 


General features of the accessible 


transcriptional regulatory factors in place of a 
canonical nucleosome, leading to a remodelled chromatin state char- 
acterized by markedly heightened accessibility to nucleases’. DNase I 
hypersensitive sites (DHSs) in chromatin were first identified over 
30 years ago, and have since been used extensively to map regulatory 
DNA regions in diverse organisms’. DNase I hypersensitivity is central 
to all defined classes of active cis-regulatory elements including enhan- 
cers, promoters, silencers, insulators and locus control regions” *. 
Because DNaseI hypersensitivity overlies cis-regulatory elements 
directly and is maximal over the core region of regulatory factor occu- 
pancy, it enables precise delineation of the genomic cis-regulatory 
compartment. DHSs are flanked by nucleosomes, which may acquire 
histone modification patterns that reflect the functional role of the 
adjoining regulatory DNA, such as the association of histone H3 lysine 4 
trimethylation (H3K4me3) with promoter elements’. Recent advances 
have enabled genome-scale mapping of DHSs in mammalian cells**, 


chromatin landscape 


Two ENCODE production centres (University of Washington and 
Duke University) profiled DNaseI sensitivity genome-wide using 
massively parallel sequencing’” in a total of 125 human cell and 
tissue types including normal differentiated primary cells (n = 71), 
immortalized primary cells (n = 16), malignancy-derived cell lines 
(n = 30) and multipotent and pluripotent progenitor cells (n = 8) 
(Supplementary Table 1). The density of mapped DNase I cleavages 
as a function of genome position provides a continuous quantitative 
measure of chromatin accessibility, in which DHSs appear as 
prominent peaks within the signal data from each cell type (Fig. la 
and Supplementary Figs 1 and 2). Analysis using a common algorithm 
(see Methods) identified 2,890,742 distinct high-confidence DHSs 
(false discovery rate (FDR) of 1%; see Methods), each of which was 
active in one or more cell types. Of these DHSs, 970,100 were specific 
to a single cell type, 1,920,642 were active in 2 or more cell types, anda 
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Figure 1 | General features of the DHS landscape. a, Density of DNase I 
cleavage sites for selected cell types, shown for an example ~350-kb region. Two 
regions are shown to the right in greater detail. b, Left: distribution of 2,890,742 
DHSs with respect to GENCODE gene annotations. Promoter DHSs are defined 
as the first DHS localizing within 1 kb upstream of a GENCODE TSS. Right: 
distribution of intergenic DHSs relative to Gencode TSSs. ¢, Distributions of the 
number of cell types, from 1 to 125 (y axis), in which DHSs in each of four classes 
(x axis) are observed. Width of each shape at a given y value shows the relative 
frequency of DHSs present in that number of cell types. 


small minority (3,692) was detected in all cell types. The relative 
accessibility of DHSs along the genome varies by >100-fold and is 
highly consistent across cell types (Supplementary Figs 1 and 2). To 
estimate the sensitivity and accuracy of the sequencing-derived DHS 
maps, one ENCODE production centre (University of Washington) 
performed 7,478 classical DNaseI hypersensitivity experiments by 
the Southern hybridization method’. Using Southern blots as the 
standard, the average sensitivity, per cell type, of DNase I-seq (at a 
sequencing depth of 30 M uniquely mapping reads) was 81.6%, with 
specificity of 99.5-99.9%. Of DHSs classified as false negatives within 
a particular cell type, an average of 92.4% were detected as a DHS in 
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another cell type or upon deeper sequencing. As such, we estimate 
that the overall sensitivity for DHSs of the combined cell type maps 
is >98%. 

Approximately 3% (n = 75,575) of DHSs localize to transcriptional 
start sites (TSSs) defined by GENCODE” and 5% (n= 135,735, 
including the aforementioned) lie within 2.5 kilobases (kb) of a TSS. 
The remaining 95% of DHSs are positioned more distally, and are 
roughly evenly divided between intronic and intergenic regions 
(Fig. 1b). Promoters typically exhibit high accessibility across cell types, 
with the average promoter DHS detected in 29 cell types (Fig. 1c, 
second column). By contrast, distal DHSs are largely cell selective 
(Fig. 1c, third column). 

MicroRNAs (miRNAs) comprise a major class of regulatory 
molecules and have been extensively studied, resulting in consensus 
annotation of hundreds of conserved miRNA genes", approximately 
one-third of which are organized in polycistronic clusters'*. However, 
most predicted promoters driving microRNA expression lack 
experimental evidence. Of 329 unique annotated miRNA TSSs 
(Supplementary Methods), 300 (91%) either coincided with or closely 
approximated (<500 base pairs (bp)) a DHS. Chromatin accessibility 
at miRNA promoters was highly promiscuous compared with 
GENCODE TSSs (Fig. 1c, fourth column), and showed cell lineage 
organization, paralleling the known regulatory roles of well-annotated 
lineage-specific miRNAs (Supplementary Fig. 3). 

The 20-50-bp read lengths from DNase I-seq experiments enabled 
unique mapping to 86.9% of the genomic sequence, allowing us to 
interrogate a large fraction of transposon sequences. A surprising 
number contain highly regulated DHSs (Fig. Ic, fifth column and 
Supplementary Figs 4 and 5), compatible with cell-specific transcrip- 
tion of repetitive elements detected using ENCODE RNA sequencing 
data'’. DHSs were most strongly enriched at long terminal repeat (LTR) 
elements, which encode retroviral enhancer structures (Supplemen- 
tary Table 2). Two such examples are shown in Supplementary Fig. 4, 
which also illustrates the strong cell-selectivity of chromatin accessibility 
seen for each major repeat class. We also documented numerous 
examples of transposon DHSs that displayed enhancer activity in tran- 
sient transfection assays (Supplementary Table 3). 

Comparison with an extensive compilation of 1,046 experimentally 
validated distal, non-promoter cis-regulatory elements (enhancers, 
insulators, locus control regions, and so on) revealed the overwhelm- 
ing majority (97.4%) to be encompassed within DNase I hypersensi- 
tive chromatin (Supplementary Table 4), typically with strong cell 
selectivity (Supplementary Fig. 2b). 


Transcription factor drivers of chromatin accessibility 


DNase I hypersensitive sites result from cooperative binding of tran- 
scriptional factors in place of a canonical nucleosome’”. To quantify 
the relationship between chromatin accessibility and the occupancy of 
regulatory factors, we compared sequencing-depth-normalized 
DNase sensitivity in the ENCODE common cell line K562 to normal- 
ized chromatin immunoprecipitation and high-throughput sequencing 
(ChIP-seq) signals from all 42 transcription factors mapped by 
ENCODE ChIP-seq"* in this cell type (Fig. 2). Simple summation of 
the ChIP-seq signals markedly parallels quantitative DNase I sensitivity 
at individual DHSs (Fig. 2a) and across the genome (r = 0.79, Fig. 2b). 
For example, the B-globin locus control region contains a major 
enhancer element at hypersensitive site 2 (HS2), which appears to be 
occupied by dozens of transcription factors (Supplementary Fig. 6a). 
Such highly overlapping binding patterns have been interpreted to 
signify weak interactions with lower-affinity recognition sequences 
potentiated by an accessible DNA template’*. However, HS2 is a com- 
pact element with a functional core spanning ~110 bp that contains 
5-8 sites of transcription factor-DNA interaction in vivo depending on 
the cell type’®'*. The fact that the cumulative ChIP-seq signal closely 
parallels the degree of nuclease sensitivity at HS2 and elsewhere is thus 
most readily explained by interactions between DNA-bound factors 
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Figure 2 | Transcription factor drivers of chromatin accessibility. a, DNase I 
tag density is shown in red for a 175-kb region of chromosome 19. Below: 
normalized ChIP-seq tag density for 45 ENCODE ChIP-seq experiments from 
K562 cells, with a cumulative sum of the individual tag density tracks shown 
immediately below the K562 DNase! data. b, Genome-wide correlation 

(r = 0.7943) between ChIP-seq and DNase] tag densities (log)9) in K562 cells. 
c, Left: 94.4% of a combined 1,108,081 ChIP-seq peaks from all transcription 
factors assayed in K562 cells fall within accessible chromatin (grey areas of pie 
chart). Top: three examples of transcription factors localizing almost 
exclusively within accessible chromatin. Bottom: three transcription factors 
from the KRAB-associated complex localizing partially or predominantly 
within inaccessible chromatin. 


and other interacting factors that collectively potentiate the accessible 
chromatin state (Supplementary Fig. 6b). Given the relatively limited 
number of factors studied, it may seem surprising that such a close 
correlation should be evident. However, most of the factors selected 
for ENCODE ChIP-seq studies have well-described or even fun- 
damental roles in transcriptional regulation, and many were identified 
originally based on their high affinity for DNA. Alternatively, as ori- 
ginally proposed in ref. 19, a limited number of factors may be involved 
in establishment and maintenance of chromatin remodelling, whereas 
others may interact nonspecifically with the remodelled state. We also 
found that the recognition sequences for a small number of factors were 
consistently linked with elevated chromatin accessibility across 
all classes of sites and all cell types (Supplementary Fig. 6c), indicating 
that regulators acting through these sequences are key drivers of the 
accessibility landscape. 

Overall, 94.4% of a combined 1,108,081 ChIP-seq peaks from all 
ENCODE transcription factors fall within accessible chromatin 
(Fig. 2c and Supplementary Fig. 7a), with the median factor having 
98.2% of its binding sites localized therein. Notably, a small number 
of factors diverged from this paradigm, including known chromatin 
repressors, such as the KRAB-associated factors KAP1 (also called 
TRIM28), SETDB1 and ZNF274 (refs 20, 21) (Fig. 2c). We hypothesized 
that a proportion of the occupancy sites of these factors represented 
binding within compacted heterochromatin. To test this, we developed 
targeted mass spectrometry assays* for KAP1 and three factors 
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localizing almost exclusively within accessible chromatin (GATAI, 
c-Jun, NRF1), and quantified their abundance in biochemically 
defined heterochromatin” against a total chromatin fraction (Sup- 
plementary Fig. 7b). This analysis confirmed that factors such as 
KAPI show a significant level of heterochromatin occupancy (Sup- 
plementary Fig. 7c). 


An invariant directional promoter chromatin signature 


The annotation of sites of transcription origination continues to be an 
active and fundamental endeavour’. In addition to direct evidence of 
TSSs provided by RNA transcripts, H3K4me3 modifications are 
closely linked with TSSs**. We therefore explored systematically the 
relationship between chromatin accessibility and H3K4me3 patterns 
at well-annotated promoters, its relationship to transcription origina- 
tion, and its variability across ENCODE cell types. 

We performed ChIP-seq for H3K4me3 in 56 cell types using the same 
biological samples used for DNaseI data (Supplementary Table 1, 
column D). Plotting DNase I cleavage density against ChIP-seq tag den- 
sity around TSSs reveals highly stereotyped, asymmetrical patterning of 
these chromatin features with a precise relationship to the TSS (Fig. 
3a, b). This directional pattern is consistent with a rigidly positioned 
nucleosome immediately downstream from the promoter DHS, and is 
largely invariant across cell types (Fig. 3b and Supplementary Fig. 8). 

To map novel promoters (and their directionality) not en- 
compassed by the GENCODE consensus annotations, we applied a 
pattern-matching approach to scan the genome across all 56 cell types 
(Supplementary Methods). Using this approach we identified a total 
of 113,622 distinct putative promoters. Of these, 68,769 correspond to 
previously annotated TSSs, and 44,853 represent novel predictions 
(versus GENCODE v7). Of the novel sites, 99.5% are supported by 
evidence from spliced expressed sequence tags (ESTs) and/or cap ana- 
lysis of gene expression (CAGE) tag clusters (Fig. 3c and 
Supplementary Fig. 9, P< 0.0001; see Supplementary Methods). We 
found novel sites in every configuration relative to existing annotations 
(Fig. 3d-f and Supplementary Fig. 10). For example, 29,203 putative 
promoters are contained in the bodies of annotated genes, of which 
17,214 are oriented antisense to the annotated direction of transcrip- 
tion, and 2,794 lie immediately downstream of an annotated gene’s 
3’ end, with 1,638 in antisense orientation. The results indicate that 
chromatin data can systematically inform RNA transcription analyses, 
and suggest the existence ofa large pool of cell-selective transcriptional 
promoters, many of which lie in antisense orientations. 


Chromatin accessibility and DNA methylation patterns 


CpG methylation has been closely linked with gene regulation, based 
chiefly on its association with transcriptional silencing”. However, 
the relationship between DNA methylation and chromatin structure 
has not been clearly defined. We analysed ENCODE reduced- 
representation bisulphite sequencing (RRBS) data, which provide 
quantitative methylation measurements for several million CpGs 
(K. E. Varley et al., manuscript submitted; see Gene Expression 
Omnibus accession GSE27584). We focused on 243,037 CpGs falling 
within DHSs in 19 cell types for which both data types were available 
from the same sample. We observed two broad classes of sites: those 
with a strong inverse correlation across cell types between DNA 
methylation and chromatin accessibility (Fig. 4a and Supplemen- 
tary Fig. 1la), and those with variable chromatin accessibility but 
constitutive hypomethylation (Fig. 4a, right). To quantify these trends 
globally, we performed a linear regression analysis between chromatin 
accessibility and DNA methylation at the 34,376 CpG-containing 
DHSs (see Supplementary Methods). Of these sites, 6,987 (20%) 
showed a significant association (1% FDR) between methylation 
and accessibility (Supplementary Fig. 11b). Increased methyla- 
tion was almost uniformly negatively associated with chromatin 
accessibility (>97% of cases). The magnitude of the association 
between methylation and accessibility was strong, with the latter on 
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Figure 3 | Identification and directional classification of novel promoters. 
a, DNase I (blue) and H3K4me3 (red) tag densities for K562 cells around 
annotated TSS of ACTR3B. b, Averaged H3K4me3 tag density (red, right y axis) 
and log DNase! tag density (blue, left y axis) across 10,000 randomly selected 
GENCODE TSSs, oriented 5’—>3’. Each blue and red curve is for a different cell 
type, showing invariance of the pattern. c, Relation of 113,615 promoter 


average 95% lower in cell types with coinciding methylation versus 
cell types lacking coinciding methylation (Supplementary Fig. 11c). 
Fully 40% of variable methylation was associated with a concomitant 
effect on accessibility. 

The role of DNA methylation in causation of gene silencing is 
presently unclear. Does methylation reduce chromatin accessibility 
by evicting transcription factors? Or does DNA methylation passively 
‘fill in’ the voids left by vacating transcription factors? Transcription 
factor expression is closely linked with the occupancy of its binding 
sites**. If the former of the two above hypotheses is correct, methyla- 
tion of individual binding site sequences should be independent of 
transcription factor gene expression. If the latter, methylation at tran- 
scription factor recognition sequences should be negatively correlated 
with transcription factor abundance (Fig. 4b). 


predictions to GENCODE annotations, with supporting EST and CAGE 
evidence (bar at right). df, Examples of novel promoters identified in K562; 
red arrow marks predicted TSS and direction of transcription, with CAGE tag 
clusters, spliced ESTs and GENCODE annotations above. d, Novel TSS 
confirmed by CAGE and ESTs. e, Novel TSS confirmed by CAGE, no ESTs. 
Note intronic location. f, Antisense prediction within annotated gene. 


Comparing transcription factor transcript levels to average 
methylation at cognate recognition sites within DHSs revealed sig- 
nificant negative correlations between transcription factor expression 
and binding site methylation for most (70%) transcription factors 
with a significant association (P< 0.05). Representative examples 
are shown in Fig. 4c and Supplementary Fig. 12a. These data argue 
strongly that methylation patterning paralleling cell-selective chro- 
matin accessibility results from passive deposition after the vacation 
of transcription factors from regulatory DNA, confirming and 
extending other recent reports”. 

Interestingly, a small number of factors showed positive correla- 
tions between expression and binding site methylation (Supplemen- 
tary Fig. 12b), including MYB and LUN-1 (also known as TOPORS). 
Both of these transcription factors showed increased transcription 
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Figure 4 | Chromatin accessibility and DNA methylation patterns. 

a, DNaseI sensitivity in 10 cell types with ENCODE reduced representation 
bisulphite sequencing data. Inset box: accessibility (y axis) decreases 
quantitatively as methylation increases. Other DHSs (right) show low 
correlation between accessibility and methylation. CpG methylation scale: 
green, 0%; yellow, 50%; red, 100%. b, Model of transcription factor (TF)-driven 
methylation patterns in which methylation passively mirrors transcription 
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factor occupancy. ¢, Relationship between transcription factor transcript levels 
and overall methylation at cognate recognition sequences of the same 
transcription factors. Lymphoid regulators in B-lymphoblastoid line GM06990 
(left) and erythroid regulators in the erythroleukaemia line K562 (right). 
Negative correlation indicates that site-specific DNA methylation follows 
transcription factor vacation of differentially expressed transcription factors. 
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and binding site methylation specifically within acute promyelocytic 
leukaemia cells (NB4), and both interact with promyelocytic leukaemia 
(PML) bodies**”’, a sub-nuclear structure disrupted in PML cells. The 
anomalous behaviour of these two transcription factors with respect to 
chromatin structure and DNA methylation may thus be related to a 
specialized mechanism seen only in pathologically altered cells. 


A map of distal DHS-to-promoter connections 


From examination of DNase! profiles across many cell types we 
observed that many known cell-selective enhancers become DHSs 
synchronously with the appearance of hypersensitivity at the pro- 
moter of their target gene (Supplementary Fig. 13). To generalize this, 
we analysed the patterning of 1,454,901 distal DHSs (DHSs separated 
from a TSS by at least one other DHS) across 79 diverse cell types 
(Supplementary Methods and Supplementary Table 6), and corre- 
lated the cross-cell-type DNaseI signal at each DHS position with 
that at all promoters within +500 kb (Supplementary Fig. 14a). We 
identified a total of 578,905 DHSs that were highly correlated (r > 0.7) 
with at least one promoter (P< 10 '®°), providing an extensive map 
of candidate enhancers controlling specific genes (Supplementary 
Methods and Supplementary Table 7). To validate the distal DHS/ 
enhancer—promoter connections, we profiled chromatin interactions 
using the chromosome conformation capture carbon copy (5C) tech- 
nique”. For example, the phenylalanine hydroxylase (PAH) gene is 
expressed in hepatic cells, and an enhancer has been defined upstream 
of its TSS (Fig. 5a). The correlation values for three DHSs within the 
gene body closely parallel the frequency of long-range chromatin 
interactions measured by 5C. The three interacting intronic DHSs 
cloned downstream of a reporter gene driven by the PAH promoter 
all showed increased expression ranging from three- to tenfold over a 
promoter-only control, confirming enhancer function. 

We next examined comprehensive promoter-versus-all 5C experi- 
ments performed over 1% of the human genome”? in K562 cells. 
DHS-promoter pairings were markedly enriched in the specific cog- 
nate chromatin interaction (P< 10 '*, Supplementary Fig. 14b). We 
also examined K562 promoter-DHS interactions detected by 
polymerase II chromatin interaction analysis with paired-end tag 
sequencing (ChIA-PET)™, which quantifies interactions between pro- 
moter-bound polymerase and distal sites. The ChIA-PET interactions 
were also markedly enriched for DHS-promoter pairings (P< 10°, 
Supplementary Fig. 14c). Together, the large-scale interaction analyses 
affirm the fidelity of DHS-promoter pairings based on correlated 
DNase] sensitivity signals at distal and promoter DHSs. 

Most promoters were assigned to more than one distal DHS, 
indicating the existence of combinatorial distal regulatory inputs for 
most genes (Fig. 5b and Supplementary Table 7). A similar result is 
forthcoming from large-scale 5C interaction data*’. Surprisingly, 
roughly half of the promoter-paired distal DHSs were assigned to 
more than one promoter (Fig. 5b and Supplementary Methods), indi- 
cating that human cis-regulatory circuitry is significantly more com- 
plicated than previously anticipated, and may serve to reinforce the 
robustness of cellular transcriptional programs. 

The number of distal DHSs connected with a particular promoter 
provides, for the first time, a quantitative measure of the overall 
regulatory complexity of that gene. We asked whether there are any 
systematic functional features of genes with highly complex regulation. 
We ranked all human genes by the number of distal DHSs paired with 
the promoter of each gene, then performed a Gene Ontology analysis 
on the rank-ordered list. We found that the most complexly regulated 
human genes were markedly enriched in immune system functions 
(Supplementary Fig. 14d), indicating that the complexity of cellular 
and environmental signals processed by the immune system is directly 
encoded in the cis-regulatory architecture of its constituent genes. 

Next, we asked whether DHS-promoter pairings reflected 
systematic relationships between specific combinations of regulatory 
factors (Supplementary Methods). For example, KLF4, SOX2, OCT4 
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Figure 5 | A genome-wide map of distal DHS-to-promoter connectivity. 

a, Cross-cell-type correlation (red arcs, left y axis) of distal DHSs and PAH 
promoter closely parallels chromatin interactions measured by 5C-seq (blue 
arcs, right y axis); black bars indicate HindIII fragments used in 5C assays. 
Known (green) and novel (magenta) enhancers confirmed in transfection 
assays are shown below. Enhancer at far right is not separable by 5C as it lies 
within the HindIII fragment containing the promoter. b, Left: proportions of 
69,965 promoters correlated (r > 0.7) with 0 to >20 DHSs within 500 kb. Right: 
proportions of 578,905 non-promoter DHSs (out of 1,454,901) correlated with 
1 to >3 promoters within 500 kb. ¢, Pairing of canonical promoter motif 
families with specific motifs in distal DHSs. 


(also called POUS5F1) and NANOG are known to form a well- 
characterized transcriptional network controlling the pluripotent 
state of embryonic stem cells**. We found significant enrichment 
(P <0.05) of the KLF4, SOX2 and OCT4 motifs within distal DHSs 
correlated with promoter DHSs containing the NANOG motif; 
enrichment of NANOG, SOX2 and OCT4 distal motifs co-occurring 
with promoter motif OCT4; and enrichment of distal SOX2 and 
OCT4 motifs with promoter SOX2 motifs (Supplementary Fig. 
15a). By contrast, promoters containing KLF4 motifs were associated 
with KLF4-containing distal DHSs, but not with DHSs containing 
NANOG, SOX2 or OCT4 motifs (Supplementary Fig. 15a, bottom). 

We also tested for significant co-associations between promoter 
types (defined by the presence of cognate motif classes; see 
Supplementary Methods) and motifs in paired distal DHSs (Fig. 5c 
and Supplementary Fig. 15b, c). For example, when a member of the 
ETS domain family (motifs ETS1, ETS2, ELF1, ELK1, NERF (also 
called ELF2), SPIB, and others) is present within a promoter DHS, 
motif PU.1 (also called SPI1) is significantly more likely to be 
observed in a correlated distal DHS (P< 10°). These results suggest 
that a limited set of general rules may govern the pairing of co- 
regulated distal DHSs with particular promoters. 
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Stereotyped chromatin accessibility parallels function 


In addition to the synchronized activation of distal DHSs and pro- 
moters described above, we observed a surprising degree of patterned 
co-activation among distal DHSs, with nearly identical cross-cell-type 
patterns of chromatin accessibility at groups of DHSs widely separated 
in trans (Supplementary Figs 16 and 17). For many patterns, we 
observed tens or even hundreds of like elements around the genome. 
The simplest explanation is that such co-activated sites share 
recognition motifs for the same set of regulatory factors. We found, 
however, that the underlying sequence features for a given pattern were 
surprisingly plastic. This suggests that the same pattern of cell-selective 
chromatin accessibility shared between two DHSs can be achieved 
by distinct mechanisms, probably involving complex combinatorial 
tuning. 

We next asked whether distal DHSs with specific functions such 
as enhancers exhibited stereotypical patterning, and whether such 
patterning could highlight other elements with the same function. 
We examined one of the best-characterized human enhancers, 
DNase I HS2 of the B-globin locus control region'® '*. HS2 is detected 
in many cell types, but exhibits potent enhancer activity only in 
erythroid cells**. Using a pattern-matching algorithm (see Supplemen- 
tary Methods) we identified additional DHSs with nearly identical 
cross-cell-type accessibility patterns (Fig. 6a). We selected 20 elements 
across the spectrum of the top 200 matches to the HS2 pattern, and 
tested these in transient transfection assays in K562 cells (Supplemen- 
tary Methods). Seventy per cent (14 of 20) of these displayed enhancer 
activity (mean 8.4-fold over control) (Fig. 6a, f). Of note, one (E3) 
showed a greater magnitude of enhancement (18-fold versus control) 
than HS2, which is itself one of the most potent known enhancers’. 
Next we selected three elements from the 14 HS2-like enhancers, 
applied pattern matching (Methods) to each to identify stereotyped 
elements, and tested samples of each pattern for enhancer activity, 
revealing additional K562 enhancers (total 15 of 25 positive) 
(Fig. 6b-d, f). In each case, therefore, we were able to discover 
enhancers by simply anchoring on the cross-cell-type DHS pattern 
of an element with enhancer activity. Collectively, these results show 
that co-activation of DHSs reflected in cross-cell-type patterning of 
chromatin accessibility is predictive of functional activity within a 
specific cell type, and suggest more generally that DHSs with stereo- 
typed cellular patterning are likely to fulfil similar functions. 

To visualize the qualities and prevalence of different stereotyped 
cross-cellular DHS patterns, we constructed a self-organizing map of 
arandom 10% subsample of DHSs across all cell types and identified a 
total of 1,225 distinct stereotyped DHS patterns (Supplementary Figs 
18 and 19). Many of the stereotyped patterns discovered by the self- 
organizing map encompass large numbers of DHSs, with some count- 
ing >1,000 elements (Supplementary Fig. 20). 

Taken together, the above results show that chromatin accessibility 
at regulatory DNA is highly choreographed across large sets of co- 
activated elements distributed throughout the genome, and that 
DHSs with similar cross-cell-type activation profiles probably share 
similar functions. 


Variation in regulatory DNA linked to mutation rate 


The DHS compartment as a whole is under evolutionary constraint, 
which varies between different classes and locations of elements’*, and 
may be heterogeneous within individual elements™. To understand the 
evolutionary forces shaping regulatory DNA sequences in humans, we 
estimated nucleotide diversity (x) in DHSs using publicly available 
whole-genome sequencing data from 53 unrelated individuals* (see 
Supplementary Methods). We restricted our analysis to nucleotides 
outside of exons and RepeatMasked regions. To provide a comparison 
with putatively neutral sites, we computed z in fourfold degenerate 
synonymous positions (third positions) of coding exons. This analysis 
showed that, taken together, DHSs exhibit lower x than fourfold 
degenerate sites, compatible with the action of purifying selection. 
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Figure 6 | Stereotyped regulation of chromatin accessibility. a-e, Enhancers 
grouped by similar chromatin stereotypes. Related cell lines are colour 
matched. HS2 from the B-globin locus control region is at left. El-E11 
represent progressively weaker matches to the HS2 stereotype. E12-13 derive 
from matches to a different stereotype based on another K562 enhancer. 

f, Experimental validation of enhancers detected by pattern matching. Bars 
indicate fold enrichment observed in transient assays in K562 relative to 
promoter-only control; mean of testing in both orientations is shown. Red bars 
indicate data from two potent in vivo enhancers, B-globin LCR HS2 and HS3; 
the latter requires chromatinization to function and is not active in transient 
assays. Gold bars indicate data from E1-E13 from a-e above. 


Figure 7a shows z for the DHSs of all analysed cell types, with colour 
coding to indicate the origin of each cell type. Particularly striking is the 
distribution of diversity relative to proliferative potential. DHSs in cells 
with limited proliferative potential have uniformly lower average 
diversity than immortal cells, with the difference most pronounced 
in malignant and pluripotent lines. This ordering is identical when 
highly mutable CpG nucleotides are removed from the analysis. 

If differences in z are due to mutation rate differences in different 
DHS compartments, the ratio of human polymorphism to human- 
chimpanzee divergence should remain constant across cell types. By 
contrast, differences in z due to selective constraint should result in 
pronounced differences. To distinguish between these alternatives, we 
first compared polymorphism and human-chimpanzee divergence 
for DHSs from normal, malignant and pluripotent cells (Fig. 7b). 
Differences in polymorphism and divergence between these three 
groups are nearly identical, compatible with a mutational cause. 
Second, raw mutation rate is expected to affect rare and common 
genetic variation equally, whereas selection is likely to have a larger 
impact on common variation. We consistently observe ~62% of 
single nucleotide polymorphisms (SNPs) in DHSs of each group to 
have derived-allele frequencies below 0.05. DHSs in different cell 
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Figure 7 | Genetic variation in regulatory DNA linked to mutation rate. 

a, Mean nucleotide diversity (7, y axis) in DHSs of 97 diverse cell types (x axis) 
estimated using whole-genome sequencing data from 53 unrelated individuals. 
Cell types are ordered left-to-right by increasing mean z. Horizontal blue bar 
shows 95% confidence intervals on mean z in a background model of fourfold 
degenerate coding sites. Note the enrichment of immortal cells at right. ES, 
embryonic stem; iPS, induced pluripotent stem. b, Mean zi (left y axis) for 


lines exhibit differences in SNP densities but not in allele frequency 
distribution (Fig. 7c). Collectively, these observations are consistent 
with increased relative mutation rates in the DHS compartment of 
immortal cells versus cell types with limited proliferative potential, 
exposing an unexpected link between chromatin accessibility, prolif- 
erative potential and patterns of human variation. 


Discussion 

Since their discovery over 30 years ago, DNase I hypersensitive sites 
have guided the discovery of diverse cis-regulatory elements in the 
human and other genomes. Here we have presented by far the most 
comprehensive map of human regulatory DNA, revealing novel 
relationships between chromatin accessibility, transcription, DNA 
methylation and the occupancy of sequence-specific factors. The wide 


pluripotent (yellow) versus malignancy-derived (red) versus normal cells (light 
green), plotted side-by-side with human-chimpanzee divergence (right y axis) 
computed on the same groups. Boxes indicate 25-75 percentiles, with medians 
highlighted. c, Both low- and high-frequency derived alleles show the same 
effect. Density of SNPs in DHSs with derived allele frequency (DAF) <5% (x 
axis) is tightly correlated (7° = 0.84) with the same measure computed for 
higher-frequency derived alleles (y axis). Colour-coding is the same as in panel a. 


analysis, some cell-type data sets that exceeded 40M tag depth were randomly 
subsampled to a depth of 30 million tags. Sequence reads were mapped using the 
Bowtie aligner, allowing a maximum of two mismatches. Only reads mapping 
uniquely to the genome were used in our analyses. Mappings were to male or 
female versions of hg19/GRCh37, depending on cell type, with random regions 
omitted. Data were analysed jointly using a single algorithm’ (Supplementary 
Methods) to localize DNase I hypersensitive sites. H3K4me3 ChIP-seq was per- 
formed using antibody 9751 (Cell Signaling) on 1% formaldehyde crosslinked 
samples sheared by Diagenode Bioruptor. Gene expression measurements for 
each cell type were performed on Affymetrix human exon microarrays. 5C 
experiments were performed as described*’*'. Transcription factor recognition 
motif occurrences within DHSs were defined with FIMO* at significance 
P<10 ° using motif models from the TRANSFAC database. 
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An expansive human regulatory lexicon 
encoded in transcription factor footprints 
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Sam John', Richard Sandstrom!, Audra K. Johnson!, Matthew T. Maurano!, Richard Humbert, Eric Rynes!, Hao Wang’, 
Shinny Vong!, Kristen Lee’, Daniel Bates', Morgan Diegel', Vaughn Roach', Douglas Dunn!, Jun Neri, Anthony Schafer’, 
R. Scott Hansen!?, Tanya Kutyavin', Erika Giste!, Molly Weaver', Theresa Canfield’, Peter Sabo', Miaohua Zhang’, 
Gayathri Balasundaram®, Rachel Byron®, Michael J. MacCoss!, Joshua M. Akey!, M. A. Bender**, Mark Groudine*°, Rajinder Kaul? 
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Regulatory factor binding to genomic DNA protects the underlying sequence from cleavage by DNasel, leaving 
nucleotide-resolution ‘footprints’. Using genomic DNaselI footprinting across 41 diverse cell and tissue types, we 
detected 45 million transcription factor occupancy events within regulatory regions, representing differential binding to 
8.4 million distinct short sequence elements. Here we show that this small genomic sequence compartment, roughly twice 
the size of the exome, encodes an expansive repertoire of conserved recognition sequences for DNA-binding proteins that 
nearly doubles the size of the human cis-regulatory lexicon. We find that genetic variants affecting allelic chromatin states 
are concentrated in footprints, and that these elements are preferentially sheltered from DNA methylation. High-resolution 
DNase I cleavage patterns mirror nucleotide-level evolutionary conservation and track the crystallographic topography of 
protein-DNA interfaces, indicating that transcription factor structure has been evolutionarily imprinted on the human 
genome sequence. We identify a stereotyped 50-base-pair footprint that precisely defines the site of transcript origination 
within thousands of human promoters. Finally, we describe a large collection of novel regulatory factor recognition motifs 
that are highly conserved in both sequence and function, and exhibit cell-selective occupancy patterns that closely parallel 


major regulators of development, differentiation and pluripotency. 


Sequence-specific transcription factors interpret 
the signals encoded within regulatory DNA. The 
discovery of DNaseI footprinting over 30 years 
ago’ revolutionized the analysis of cis-regulatory 


ENCODE 


Encyclopedia of DNA Elements 
nature.com/encode 


53-81% of DNaseI cleavage sites localized to 
DNase I-hypersensitive regions’ (Supplemen- 
tary Table 1), representing nearly tenfold higher 
signal-to-noise ratio than previous results from 


sequences in diverse organisms, and directly 
enabled the discovery of the first human sequence-specific transcription 
factors’. Binding of transcription factors to regulatory DNA regions in 
place of canonical nucleosomes triggers chromatin remodelling, result- 
ing in nuclease hypersensitivity’. Within DNase I hypersensitive sites 
(DHSs), DNase I cleavage is not uniform; rather, punctuated binding by 
sequence-specific regulatory factors occludes bound DNA from cleav- 
age, leaving footprints that demarcate transcription factor occupancy at 
nucleotide resolution’* (Fig. 1a). DNase I footprinting has been applied 
widely to study the dynamics of transcription factor occupancy and 
cooperativity within regulatory DNA regions of individual genes’, and 
to identify cell- and lineage-selective transcriptional regulators’. 


Regulatory DNA is populated with DNase I footprints 


To map DNase I footprints comprehensively within regulatory DNA, 
we adapted digital genomic footprinting* to human cells. The ability 
to resolve DNaseI footprints sensitively and precisely is critically 
dependent on the local density of mapped DNaseI cleavages 
(Supplementary Fig. la-d), and efficient footprinting of a large 
genome such as human requires substantial concentration of 
DNase I cleavages within the small fraction (~1-3%) of the genome 
contained in DNaseI-hypersensitive regions. We selected highly 
enriched DNase I cleavage libraries from 41 diverse cell types in which 


yeast*, and two- to fivefold greater enrichment 
than achieved using end-capture of single DNase I cleavages*’. We 
then performed deep sequencing of these libraries, and obtained 14.9 
billion Illumina sequence reads, 11.2 billion of which mapped to 
unique locations in the human genome (Supplementary Table 1). 
We achieved an average sequencing depth of ~273 million DNase I 
cleavages per cell type that enabled extensive and accurate discrim- 
ination of DNase I footprints. 

To detect DNaseI footprints systematically, we implemented a 
detection algorithm based on the original description of quantitative 
DNaseI footprinting’ (Supplementary Methods). We identified an 
average of ~1.1 million high-confidence (false discovery rate (FDR) 
of 1%) footprints per cell type (range 434,000 to 2.3 million; 
Supplementary Table 1), and collectively 45,096,726 6-40-base pair 
(bp) footprint events across all cell types. We resolved cell-selective 
footprint patterns to reveal 8.4 million distinct elements with a foot- 
print, each occupied in one or more cell type. At least one footprint was 
found in >75% of DHSs (Supplementary Fig. 1c, dand Supplementary 
Table 2), with detection strongly dependent on the number of mapped 
DNase! cleavages within each DHS. 99.8% of DHSs with >250 
mapped DNase I cleavages contained at least one footprint, indicating 
that DHSs are not simply open or nucleosome-free chromatin features, 
but are constitutively populated with DNaseI footprints. Modelling 
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Figure 1 | Parallel profiling of genomic regulatory factor occupancy across 
41 cell types. a, DNaseI footprinting of K562 cells identifies the individual 
nucleotides within the MTPN promoter that are bound by NRF1. b, Example 
locus harbouring eight clearly defined DNase I footprints in T-helper type 1 
(Ty1) and SK-N-SH_RA cells, with TRANSFAC database motif instances 
indicated below. c, Heat maps showing per-nucleotide DNase I cleavage (left) 
and vertebrate conservation by phyloP (right) for 4,262 NRF1 motifs within 


DNase! cleavage patterns using empirically derived intrinsic DNA 
cleavage propensities for DNase I showed that only a miniscule frac- 
tion (0.24%) of discovered 1% FDR footprints from cell and tissue 
samples could be caused by inherent DNaseI sequence specificity 
(Supplementary Methods). 

DNase I footprints were distributed throughout the genome, includ- 
ing intergenic regions (45.7%), introns (37.7%), upstream of transcrip- 
tional start sites (TSSs, 8.9%), and in 5’ and 3’ untranslated regions 
(UTRs, 1.4% and 1.3%, respectively; Supplementary Fig. 2a, b). DNase I 
footprints were enriched in promoters (3.6-fold; P<2.2 x 101°; 
Binomial test) and 5’ UTRs (2.4-fold; P< 2.2 x 10 '°; Binomial test), 
commensurate with high DNase I cleavage densities observed in these 
regions. We found that 2.0% of footprints localized within exons, rais- 
ing the possibility that occupancy by DNA binding proteins could 
further restrict sequence diversity within coding DNA, thus super- 
imposing an unexpected layer of constraint on codon usage. 


Footprints are quantitative markers of factor occupancy 
We next examined the correspondence between DNase I footprints 
and known regulatory factor recognition sequences within DNase I 
hypersensitive chromatin. Comprehensive scans of DNaseI hyper- 
sensitive regions for high-confidence matches to all recognized tran- 
scription factor motifs in the TRANSFAC™ and JASPAR" databases 
revealed a striking enrichment of motifs within footprints (P ~ 0, 
z-score = 204.22 for TRANSFAC; z-score = 169.88 for JASPAR; 
Fig. 1b and Supplementary Fig. 3). 

To quantify the occupancy at transcription factor recognition 
sequences within DHSs genome-wide, we computed for each instance 
a footprint occupancy score (FOS) relating the density of DNase I 
cleavages within the core recognition motif to cleavages in the imme- 
diately flanking regions (Supplementary Methods). The FOS can be 
used to rank motif instances by the ‘depth’ of the footprint at that 
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DNase | 
footprint strength 
K562 DHSs ranked by the local density of DNase I cleavages. Green ticks 
indicate the presence of DNase! footprints over motif instances. Blue ticks 
indicate the presence of ChIP-seq peaks over the motif instances. d, Lowess 
regression of NRF1, USF1, NFE2 and NFYA K562 ChIP-seq signal intensities 
versus DNase I footprinting occupancy (footprint occupancy score) at K562 
DNase! footprints containing NRF1, USF, NFE2 and NFYA motifs. 


position, and is expected to provide a quantitative measure of factor 
occupancy’. To examine this relationship for a well-studied sequence- 
specific regulator (NRF1; ref. 12), we plotted DNase I cleavage pat- 
terns surrounding all 4,262 NRF1 motifs contained within DHSs and 
ranked these by FOS. Whereas only a subset of these motif instances 
(2,351) coincided with high-confidence footprints, the vast majority 
of NRF1 motif instances in DNaseI footprints (89%) overlapped 
reproducible sites of NRF1 occupancy identified by chromatin immu- 
noprecipitation followed by high-throughput sequencing (ChIP-seq) 
(Fig. 1c). In parallel, we analysed nucleotide-level evolutionary con- 
servation patterns around NRF1-binding sites, revealing that FOS 
closely parallels phylogenetic conservation within the core motif 
region, indicating strong selection on factor occupancy (Fig. 1c). We 
observed a nearly monotonic relationship between FOS and ChIP-seq 
signal intensities at NRF1-binding sites within DNase I footprints of 
K562 cells (Fig. 1d). Similarly strong correlations between footprint 
occupancy and either ChIP-seq signal or phylogenetic conservation 
were evident for diverse factors (Fig. 1d and Supplementary Fig. 4a—d). 
We found that footprint occupancy and nucleotide-level conservation 
correlated for 80% of all transcription factor motifs in the TRANSFAC 
database, of which 50% were statistically significant (P< 0.05; 
Supplementary Methods). This relationship between footprint occu- 
pancy and conservation is most readily explained by evolutionary 
selection on factor occupancy, with higher conservation of higher 
affinity binding sites. Taken together, these results indicate that foot- 
print occupancy provides a quantitative measure of sequence-specific 
regulatory factor occupancy that closely parallels evolutionary con- 
straint and ChIP-seq signal intensity. 

To validate the potential for selective binding of footprints by factors 
predicted on the basis of motif-to-footprint matching, we developed an 
approach to quantify specific occupancy in the context of a complex 
transcription factor milieu using targeted mass spectrometry (DNA 
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interacting protein precipitation or DIPP; Methods). Using DIPP, we 
affirmed specific binding by several different classes of transcription 
factor (Supplementary Fig. 5a—e). Together with the analysis of ChIP- 
seq data described above, these results indicate that the localization of 
transcription factor recognition motifs within DNase I footprints can 
accurately illuminate the genomic protein occupancy landscape. 


Footprints harbour functional SNVs and lack methylation 


The potential for single nucleotide variants (SNVs) within a transcrip- 
tion factor recognition sequence to abrogate binding of its cognate 
factor is well known’. The depth of sequencing performed in the 
context of our footprinting experiments provided hundreds- to thou- 
sands-fold coverage of most DHSs, enabling precise quantification of 
allelic imbalance within DHSs harbouring heterozygous variants. We 
scanned all DHSs for heterozygous SNVs identified by the 1000 
Genomes Project’* and measured, for each DHS containing a single 
heterozygous variant, the proportion of reads from each allele. We 
identified likely functional variants conferring significant allelic 
imbalance in chromatin accessibility and analysed their distribution 
relative to DNaseI footprints. This analysis revealed significant 
enrichment (P< 2.2 X 10 '°; Fisher’s exact test) of such variants 
within DNaseI footprints (Supplementary Fig. 6). For example, 
rs4144593 is a common T-to-C (T/C) variant that lies within a 
DHS on chromosome 9. This variant falls on a high-information 
position within a footprint containing an NF1/CTF1 motif and sub- 
stantially disrupts footprinting of this motif, resulting in allelic imbal- 
ance in chromatin accessibility (Fig. 2a). 

Protein-DNA interactions are also sensitive to cytosine methyla- 
tion’*"*. Comparing DNase I footprints and whole-genome bisulphite 
sequencing methylation data from pulmonary fibroblasts (IMR90), 
we found that CpG dinucleotides contained within DNase I footprints 
were significantly less methylated than CpGs in non-footprinted 
regions of the same DHS (Mann-Whitney U-test; P< 2.2 x 102°; 
Fig. 2b). Footprints therefore seem to be selectively sheltered from 
DNA methylation, indicating a widespread connection between 
regulatory factor occupancy and nucleotide-level patterning of 
epigenetic modifications. 


Transcription factor structure is imprinted on the genome 
We observed surprisingly heterogeneous base-to-base variation in 
DNase I cleavage rates within the footprinted recognition sequences 
of different regulatory factors. And yet, the per site cleavage profiles 
for individual factors were highly stereotyped, with nearly identical 
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Figure 2 | DNase I footprints mark sites of in vivo protein occupancy. 

a, Schematic and plots showing the effect of T/C SNV rs4144593 on protein 
occupancy and chromatin accessibility. The y axis of the bar graph shows the 
number of DNase I cleavage events containing either the T or C allele. Middle 
plots show T or C allele-specific DNase I cleavage profiles from ten cell lines 
heterozygous for the T/C alleles at rs4144593. Right plots show DNase I 
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local cleavage patterns at thousands of genomic locations (Supplemen- 
tary Fig. 7). This raised the possibility that DNase I cleavage patterns 
may provide information concerning the morphology of the DNA- 
protein interface. We obtained the available DNA-protein co-crystal 
structures for human transcription factors, and mapped aggregate 
DNase I cleavage patterns at individual nucleotide positions onto the 
DNA backbone of the co-crystal model. Figure 3a and Supplementary 
Fig. 8a show two examples: USF1 (ref. 17) and SRF"’. For both factors, 
DNase I cleavage patterns clearly parallel the topology of the protein- 
DNA interface, including a marked depression in DNase I cleavage at 
nucleotides involved in protein-DNA contact, and increased cleavage 
at exposed nucleotides such as those within the central pocket of the 
leucine zipper. These data show that nucleotide-level aggregate 
DNaseI cleavage patterns reflect fundamental features of the pro- 
tein-DNA interaction interface at unprecedented resolution. 

We next asked how these patterns related to evolutionary conser- 
vation. Plotting nucleotide-level aggregate DNase! cleavage in par- 
allel with per-nucleotide vertebrate conservation calculated by 
phyloP”’ revealed striking antiparallel patterning of cleavage versus 
conservation across nearly all motifs examined (six representative 
examples are shown in Fig. 3b and Supplementary Fig. 8b). 
Notably, conservation is not limited to only DNA contacting protein 
residues, but exhibits graded changes that mirror DNase I accessibility 
across the entirety of the protein-DNA interface (Supplementary Figs 
8c, d). Taken together, these results imply that regulatory DNA 
sequences have evolved to fit the continuous morphology of the tran- 
scription factor-DNA binding interface. 


A ~50-bp footprint localizes transcription initiation 
Transcription initiation requires the binding of multi-protein 
complexes that position RNA polymerase II*°*’. Using a modified 
footprint detection algorithm designed to detect larger features 
(Supplementary Methods), we scanned the regions upstream from 
GENCODE TSSs and identified highly stereotyped ~80-bp 
chromatin structure comprising a prominent ~50-bp central DNase I 
footprint, flanked symmetrically by ~15-bp regions of uniformly ele- 
vated levels of DNase I cleavage (Fig. 4a). Alignment of per-nucleotide 
DNase I cleavage profiles from 5,041 prominent footprints mapped in 
different K562 promoters highlights the homogeneous, nearly invari- 
ant nature of the structure (Fig. 4b). 

Plotting evolutionary conservation in parallel with DNase I cleavage 
revealed two distinct peaks in evolutionary conservation within the 
central footprint (Fig. 4c) compatible with binding sites for paired 


CpG methylation in 
DNase | footprints 


‘+ TorC at SNV rs4144593 
B | Chr9: 36399995 . __ 1005 
8 1 P<22x 1076 
g 7c 804 
fe) :o 
= ae) 
2 CcTG GCAGAGAGACAACAGA : =. Pe 
° even 12 807 Joo x 1078 
5 : 


NF1/CTF1 motif ‘ 


at, 


I Chr9: 36399995 


Average CpG methylation (% 


3 

ro n 205 

a fo) 

8 2 04 

= aS) CN OHO” 

fe CTGTTSGCCAGAGAGACAACAGA ; 5 FEDEDH 

. a? 58825 

§ S680 

NF1/CTF1 motif ‘ 2 & 

Boece ae tea nok cae ae eae Within DHSs 


cleavage profiles from 18 cell lines homozygous for the C allele at rs4144593 and 
one cell line homozygous for the T allele at rs4144593. Cleavage plots are cut off 
at 60% cleavage height. b, The average CpG methylation within IMR90 DNase I 
footprints, IMR90 DHSs (but not in footprints) and non-hypersensitive 
genomic regions in IMR90 cells. CpG methylation is significantly depleted in 
DNase I footprints (P< 2.2 x tor Mann-Whitney U-test). 
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Figure 3 | Footprint structure parallels transcription factor structure and is 
imprinted on the human genome. a, The co-crystal structure of upstream 
stimulatory factor (USF1) bound to its DNA ligand is juxtaposed above the 
average nucleotide-level DNase! cleavage pattern (blue) at motif instances of 
USF in DNase I footprints. Nucleotides that are sensitive to cleavage by DNase I 
are coloured blue on the co-crystal structure. The motif logo generated from 
USF DNase I footprints is displayed below the DNase I cleavage pattern. Below 
is a randomly ordered heat map showing the per-nucleotide DNase I cleavage 
for each motif instance of USF in DNase] footprints. b, The per-base DNase I 
hypersensitivity (blue) and vertebrate phylogenetic conservation (red) for all 
DNase! footprints in dermal fibroblasts matching three well-annotated 
transcription factor motifs. The white box indicates width of consensus motif. 
The number of motif occurrences within DNase I footprints is indicated below 
each graph. 


canonical sequence-specific transcription factors. The density of 
capped analysis of gene expression (CAGE) tags (Fig. 4d; green line) 
and 5’ ends of expressed sequenced tags (ESTs) (Fig. 4d; orange line) 
relative to the central ~50-bp footprint revealed that, at the vast majority 
of promoters, RNA transcript initiation localized precisely within the 
stereotyped footprint. It is notable that the location of this footprint is 
often offset, typically 5’, from many GENCODE-annotated TSSs. This 
probably derives from the incomplete nature of many of the 5’ transcript 
ends used to define TSSs*. 

These data together define a new high-resolution chromatin struc- 
tural signature of transcription initiation and the interaction of the 
pre-initiation complex with the core promoter. Indeed, chromatin 
occupancy of TATA-binding protein (TBP), a critical component of 
the pre-initiation complex, is maximal precisely over the centre of the 
50-bp footprint region (Supplementary Fig. 9a). Sequence analysis of 
the two conservation peaks within the 50-bp footprint identified 
motifs for GC-box-binding proteins such as SP1 and, less frequently, 
other general transcription factors (though with the notable absence 
of TATA motifs) (Supplementary Fig. 9b), indicating that TBP (and 
potentially other pre-initiation complex components) interacts pref- 
erentially with general transcriptional factors bound to GC-box-like 
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Figure 4 | A highly stereotyped chromatin structural motif marks sites of 
transcription initiation in human promoters. a, A 35-55-bp footprint is the 
predominant feature of many promoter DHSs and is in tight spatial 
coordination with the transcription start site. b, Heat map of the per-nucleotide 
DNase I cleavage pattern at 5,041 instances of this stereotypical footprint in 
K562 cells. c, Aggregate per-base DNase I cleavage profile (blue line) and mean 
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stereotypical footprint in K562 cells (red dashed line). d, Aggregate strand 
corrected CAGE sequencing data (green line) and the average nearest 5’ end of 
a spliced EST (orange line) surrounding instances of this stereotypical footprint 
in K562 cells. 


features in the central footprinted region. The results are therefore 
consistent with a model in which a limited number of sequence- 
specific factors function both to prime the chromatin template 
for recruitment of RNA polymerase II and to guide transcriptional 
positioning. 


Distinguishing indirect transcription factor occupancy 
Many transcriptional regulators are posited to interact indirectly with 
the DNA sequence of some target sites through mechanisms such as 
tethering®. Approaches such as ChIP-seq detect chromatin occu- 
pancy, but cannot by themselves distinguish sites of direct DNA 
binding from non-canonical indirect binding. We therefore asked 
whether DNaseI footprint data could illuminate ChIP-seq-derived 
occupancy profiles by differentiating directly bound factors from 
indirect binding events. We first partitioned ChIP-seq peaks from 
each of 38 ENCODE transcription factors*® mapped in K562 cells 
into three categories of predicted sites: ChIP-seq peaks containing a 
compatible footprinted motif (directly bound sites); ChIP-seq peaks 
lacking a compatible motif or footprint (indirectly bound sites); and 
ChIP-seq peaks overlying a compatible motif lacking a footprint 
(indeterminate sites). Predicted indirect sites showed significantly 
reduced ChIP-seq signal compared with predicted directly bound 
sites (Supplementary Fig. 10), consistent with lack of direct crosslink- 
ing to DNA (and therefore reduced ChIP efficiency). Indeterminate 
sites exhibited low ChIP-seq signal and were therefore excluded from 
further analysis (Supplementary Fig. 10). 

The fraction of ChIP-seq peaks predicted to represent direct versus 
indirect binding varied widely between different factors, ranging from 
nearly complete direct sequence-specific binding (for example, 
CTCF), to nearly complete indirect binding (for example, TBP; 
Supplementary Fig. 11). In many cases factors that preferentially 
engage in direct DNA binding at distal sites show predominantly 
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indirect occupancy in promoter regions and vice versa (Supplemen- 
tary Fig. 12a, b). 

Next, we analysed the frequency with which indirectly bound sites 
of one transcription factor coincided with directly bound sites of a 
second factor, indicative of protein-protein interactions (for example, 
tethering). This analysis recovered many known protein-protein 
interactions, such as CTCF-YY1 and TAL1-GATAI (ref. 27), as well 
as many novel associations (Fig. 5). We observed enrichment for 
NFE2 indirect interactions at promoter-bound USEF2 sites, compatible 
with their known interaction”. At distal sites, we observed the opposite, 
with NFE2 predominantly directly bound accompanied by USF2 
indirect peaks (Supplementary Fig. 12a, b), indicating the possibility 
of a reciprocal or looping mechanism. Notably, directly bound 
promoter-predominant transcription factors were enriched for 
co-localization with indirect peaks compared to distal regions (Sup- 
plementary Fig. 13a, b). These results suggest that combining DNase I 
footprinting with ChIP-seq has the potential to expose a previously 
unappreciated landscape of complex transcription factor occupancy 
modes. 


Footprints encode an expansive cis-regulatory lexicon 
Since the discovery of the first sequence-specific transcription 
factor*’, considerable effort has been devoted to identifying the 
cognate recognition sequences of DNA-binding proteins*®*’. 
Despite these efforts, high-quality motifs are available for only a 
minority of the >1,400 human transcription factors with predicted 
sequence-specific DNA binding domains”. 

We reasoned that the genomic sequence compartment defined by 
DNase I footprints in a given cell type ideally should contain much, if 
not all, of the factor recognition sequence information relevant for 
that cell type. Consequently, applying de novo motif discovery to the 
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Figure 5 | Distinguishing direct and indirect binding of transcription 
factors. Heat map of the enrichment of pairs of transcription factors in a 
direct—indirect association. Direct peaks are defined by ChIP occupancy 
accompanied by a footprint overlapping a compatible motif. Indirect peaks do 
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footprint compartments gleaned from multiple cell types should 
greatly expand our current knowledge of biologically active transcrip- 
tion factor binding motifs. 

We performed unbiased de novo motif discovery within the foot- 
prints identified in each of the 41 cell types that yielded 683 unique 
motif models (Fig. 6a and Supplementary Methods). We compared 
these models with the universe of experimentally grounded motif 
models in the TRANSFAC, JASPAR and UniPROBE* databases. 
Owing to the redundancy of motif models contained within these 
databases, we first collapsed all duplicate models (Supplementary 
Methods). A total of 394 of the 683 (58%) de novo motifs matched 
distinct experimentally grounded motif models, accounting collec- 
tively for 90% of all unique entries across the three databases 
(Fig. 6b and Supplementary Fig. 14a-c). The wholesale de novo 
derivation of the vast majority of known regulatory factor recognition 
sequences from the small genomic compartment defined by DNase I 
footprints highlights the marked concentration of regulatory 
information encoded within this sequence space. 

Notably, 289 of the footprint-derived motifs were absent from 
major databases (Fig. 6b and Supplementary Fig. 14d). These novel 
motifs populate millions of DNaseI footprints (Fig. 6c), and show 
features of in vivo occupancy and evolutionary constraint similar to 
motifs for known regulators, including marked anti-correlation 
with nucleotide-level vertebrate conservation (Figs 3b, 6e and 
Supplementary Figs 8 and 15a). 

To test whether novel motifs were functionally conserved in an 
evolutionarily distant mammal, we analysed DNaseI cleavage 
patterns around human novel motifs mapped within DHSs assayed 
in primary mouse liver tissue (Fig. 6e, fand Supplementary Fig. 15a, b). 
This analysis demonstrated that many novel motifs show nearly 
identical DNaseI footprint patterns in both human cells and mouse 
liver, indicating that these novel motifs correspond to evolutionarily 
conserved transcriptional regulators that are functional in both mouse 
and human. 

Given the conservation of protein occupancy in a distant mammal, 
we assessed whether the novel motifs are under selection in human 
populations by analysing nucleotide diversity across all motif 
instances found within accessible chromatin. Using high-quality 
genomic sequence data from 53 unrelated individuals** (Supplemen- 
tary Table 4), we calculated the average nucleotide diversity” for each 
individual motif space (Supplementary Fig. 15c). Reduced diversity 
levels are indicative of functional constraint, through the elimination 
of deleterious alleles from the population by natural selection. We 
found that novel motifs are collectively under strong purifying selec- 
tion in human populations. On average, the new motifs are more 
constrained than most motifs found in the major databases (Fig. 6d 
and Supplementary Fig. 15c), even after exclusion of motifs contain- 
ing highly mutable CpG dinucleotides, which underlie the marked 
increase in nucleotide diversity seen with a subset of known motifs 
(Supplementary Fig. 15c, right). Collectively, these results demon- 
strate that DNaseI footprints encode an expansive cis-regulatory 
lexicon encompassing both known transcription factor recognition 
sequences and novel motifs that are functionally conserved in mouse 
and bear strong signatures of ongoing selection in humans. 


Novel motif occupancy parallels regulators of cell fate 

Cell-selective gene regulation is mediated by the differential occu- 
pancy of transcriptional regulatory factors at their cognate cis-acting 
elements. For example, the nerve growth factor gene VGFis selectively 
expressed only within neuronal cells (Fig. 7a), presumably due to the 
repressive action of the transcriptional regulator NRSF (also called 
REST) at the VGF promoter in non-neuronal cell types*’. Although 
VGF is expressed only in neuronal cells, its promoter is DNase I- 
hypersensitive in most cell types (not shown). Examination of 
nucleotide-level cleavage patterns within the VGF promoter exposes 
its fundamental cis-regulatory logic, coordinated by the transcriptional 
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Figure 6 | De novo motif discovery expands the human regulatory lexicon. 
a, Overview of de novo motif discovery using DNase I footprints. b, Annotation 
of the 683 de novo-derived motif models using previously identified 
transcription factor motifs. A total of 394 of these de novo-derived motifs match 
a motif annotated within the TRANSFAC, JASPAR or UniPROBE databases, 
whereas 289 are novel motifs (pie chart). The de novo consensus matching 
TRANSFAC, JASPAR or UniPROBE sequences cover the majority of each 
database (bar chart). c, Example of a DNase I footprint found in multiple cell 
types that is annotated solely by one of the novel de novo-derived motifs. d, Box- 
and-whisker plot comparing the average nucleotide diversity at instances of the 
289 novel de novo-derived motif models to instances of motifs present in 


regulators NRSF, SP1, USF1 and NRF1. Whereas the NRSF motif is 
tightly occupied in non-neuronal cells, in neuronal cells, NRSF repres- 
sion is relieved, and recognition sites for the positive regulators USF1 
and SP1 become highly occupied, resulting in VGF expression. These 
data collectively illustrate the power of genomic footprinting to resolve 
differential occupancy of multiple regulatory factors in parallel at 
nucleotide resolution. 

We next extended this paradigm using genome-wide DNase I foot- 
prints across 12 functionally distinct cell types to identify both known 
and novel factors showing highly cell-specific occupancy patterns. To 
calculate the footprint occupancy of a motif, we enumerated for each 
motif and cell type the number of motif instances encompassed within 
DNase! footprints and normalized this by the total number of 
DNase! footprints in that cell type. Figure 7b shows a heat-map 
representation of cell-selective occupancy at motifs for 60 known 
transcriptional regulators and for 29 novel motifs. This approach 
appropriately identified a number of known cell-selective transcrip- 
tional regulators including: (1) the pluripotency factors OCT4 (also 
called POU5F1), SOX2, KLF4 and NANOG in human embryonic 
stem cells”; (2) the myogenic factors MEF2A and MYF6 in skeletal 
myocytes**; and (3) the erythrogenic regulators GATA1, STAT] and 
STATSA in erythroid cells*” (Fig. 7b). 
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databases of known specificities (x axis). The box defines the 25% and 75% 
percentiles and the whiskers display 1.5 times the inner quartile range of the 
distribution of z values in each respective database. The blue bar indicates the 
average nucleotide diversity (zz) at fourfold degenerate coding sites (width is 
equal to 95% confidence interval); gold bar indicates 7 at all coding sites (width 
is equal to 95% confidence interval). e, Phylogenetic conservation (red dashed) 
and per-base DNase I hypersensitivity (blue) for all DNase I footprints in 
dermal fibroblast cells matching two novel de novo-derived motifs. The white 
box indicates width of consensus motif. f, Per-nucleotide mouse liver DNase I 
cleavage patterns at occurrences of the motifs in e at DNase I footprints 
identified in mouse liver. 


Many of the footprint-derived novel motifs displayed markedly cell- 
selective occupancy patterns highly similar with the aforementioned 
well-established regulators. This suggests that many novel motifs 
correspond to recognition sequences for important but uncharacterized 
regulators of fundamental biological processes. Notably, both known 
and novel motifs with high cell-selective occupancy predominantly 
localized to distal regulatory regions (Fig. 7c), further highlighting 
the role of distal regulation in developmental and cell-selective 
processes’, 


Perspective 


We describe an expansive map of regulatory factor occupancy at 
millions of precisely demarcated sequence elements across the human 
genome revealed by genomic DNase I footprinting applied to a wide 
spectrum of cell types. These elements collectively define a highly 
information-rich genomic sequence compartment that encodes the 
recognition landscape of hundreds of DNA-binding proteins. This 
compartment has been extensively shaped by evolutionary forces to 
match closely the physical properties of its cognate interacting 
proteins. Mining footprint sequences for recognition motifs has 
nearly doubled the human cis-regulatory lexicon, exposing a previ- 
ously hidden trove of elements with evolutionary, structural and 
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Figure 7 | Multi-lineage DNase I footprinting reveals cell-selective gene 
regulators. a, Comparative footprinting of the nerve growth factor gene (VGF) 
promoter in multiple cell types reveals both conserved (NRF1, USF1 and SP1) 
and cell-selective (NRSF) DNase! footprints. b, Shown is a heat map of 
footprint occupancy computed across 12 cell types (columns) for 89 motifs 
(rows), including well-characterized cell/tissue-selective regulators, and novel 


functional profiles that parallel the collections of experimentally 
derived genomic regulators brought to light during the past 30 years. 
Because the ability to resolve footprints is dependent on sequencing 
depth, and the sequencing level of DNase I cleavage events in most 
DHSs is not saturating (even in cell types with >500 million mapped 
unique DNase I cleavages), the present study, although extensive in 
many respects, represents only an initial foray into this biologically 
rich space. Identification of the cognate DNA-binding proteins for 
novel recognition sequences presents a significant challenge, although 
one that can be addressed with confidence using emerging technolo- 
gies and our extensive experimental data demonstrating both occu- 
pancy in vivo and strong evolutionary signatures of function. On a 
broader level, the approach that we describe here can, in principle, be 
applied to derive the cis-regulatory lexicon of any organism. We 
anticipate that the extensive new resources we describe, particularly 
in combination with other ENCODE data, will help to advance many 
aspects of human gene regulation research. Co-published ENCODE- 
related papers can be explored online via the Nature ENCODE 
explorer (http://www.nature.com/ENCODE), a specially designed 
visualization tool that allows users to access the linked papers and 
investigate topics that are discussed in multiple papers via thematic- 
ally organized threads. 
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de novo-derived motifs (red text). The motif models for some of these novel de 
novo-derived motifs are indicated next to the heat map. c, The proportion of 
motif instances in DNase! footprints within distal regulatory regions for 
known (black) and novel (red) cell-type-specific regulators in b is indicated. 
Also noted are these values for a small set of known promoter-proximal 
regulators (green). ES, embryonic stem. 


METHODS SUMMARY 


DNase! digestion and high-throughput sequencing were performed on intact 
human nuclei from various cell types, following published methods*™. Briefly, 
roughly 10 million cells were grown in appropriate culture media and nuclei were 
extracted using NP-40 in an isotonic buffer. The NP-40 detergent was removed 
and the nuclei were incubated for 3 min at 37 °C with limiting concentrations of 
the DNA endonuclease, DNase I (Sigma) supplemented with Ca?* and Mg”*. 
The digestion was stopped with EDTA and the samples were treated with 
proteinase K. The small ‘double-hit’ fragments (<500 bp) were recovered by 
sucrose ultra-centrifugation, end-repaired and ligated with adapters compatible 
with the Illumina sequencing platform. High-quality libraries from each cell type 
were sequenced on the Illumina platform to an average depth of 273 million 
uniquely mapping single-end tags. The sequencing tags were aligned to the 
human reference genome and per-nucleotide cleavage counts were generated 
by summing the 5’ ends of the aligned sequencing tags at each position in the 
genome. FDR 1% DNase1I footprints were identified using an iterative search 
method based on optimization of the footprint occupancy score. De novo motif 
discovery was performed using a full enumeration algorithm. 
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Architecture of the human regulatory 
network derived from ENCODE data 


Mark B. Gerstein!?°*, Anshul Kundaje**, Manoj Hariharan**, Stephen G. Landt®*, Koon-Kiu Yanb?*, Chao Cheng?*, 
Xinmeng Jasmine Mu!*, Ekta Khurana!*, Joel Rozowsky”*, Roger Alexander!*, Renqiang Min!*°*, Pedro Alves!*, 
Alexej Abyzov'?, Nick Addleman®, Nitin Bhardwaj’, Alan P. Boyle°, Philip Cayting” , Alexandra Charos’, David Z. Chen’, 


Yong Cheng”, Declan Clarke®, Catharine Eastman’, Ghia Euskirchen’, Seth Frietze 


, Yao Ful, Jason Gertz!°, Fabian Grubert”, 


Arif Harmanci!?, Preti Jain!®, Maya Kasowski”, Phil Lacroute®, Jing Leng’, Jin Lian", Hannah Monahan’, Henriette O’Geen’’, 
Zhengqing Ouyang”, E. Christopher Partridge!°, Dorrelyn Patacsil?, Florencia Pauli!®, Debasish Raha’, Lucia Ramirez?, 

Timothy E. Reddy’°+, Brian Reed’, Minyi Shi”, Teri Slifer’, Jing Wang', Linfeng Wu’, Xingiong Yang”, Kevin Y. Yip'*’, 

Gili Zilberman-Schapira', Serafim Batzoglou*, Arend Sidow™, Peggy J. Farnham’, Richard M. Myers’, Sherman M. Weissman" 
& Michael Snyder? 


Transcription factors bind in a combinatorial fashion to specify the on-and-off states of genes; the ensemble of 
these binding events forms a regulatory network, constituting the wiring diagram for a cell. To examine the 
principles of the human transcriptional regulatory network, we determined the genomic binding information of 
119 transcription-related factors in over 450 distinct experiments. We found the combinatorial, co-association of 
transcription factors to be highly context specific: distinct combinations of factors bind at specific genomic locations. 
In particular, there are significant differences in the binding proximal and distal to genes. We organized all the 
transcription factor binding into a hierarchy and integrated it with other genomic information (for example, 
microRNA regulation), forming a dense meta-network. Factors at different levels have different properties; for 
instance, top-level transcription factors more strongly influence expression and middle-level ones co-regulate 
targets to mitigate information-flow bottlenecks. Moreover, these co-regulations give rise to many enriched 
network motifs (for example, noise-buffering feed-forward loops). Finally, more connected network components 
are under stronger selection and exhibit a greater degree of allele-specific activity (that is, differential binding to the 
two parental alleles). The regulatory information obtained in this study will be crucial for interpreting personal genome 


sequences and understanding basic principles of human biology and disease. 


A central goal in biology is to understand how a 
limited cohort of transcription factors is able to 
organize the large diversity of gene-expression 
patterns in different cell types and conditions. 


ENCODE 


Encyclopedia of DNA Elements 
nature.com/encode 


thus far'’'°. The large-scale data from the 
ENCODE project now begins to enable such 
analyses*’. Moreover, with the vast amount of 
human polymorphism data and genome 


Over the past decade, system-wide analyses of 
transcription-factor-binding patterns have been performed in unicel- 
lular model organisms, such as Escherichia coli and yeast, and have 
revealed a great deal of information about the organization of regu- 
latory information’ *. These studies have provided insights into such 
features as network hubs’, connectivity correlations’, hierarchical 
organization’®”* and network motifs'*’*. Moreover, more complex 
networks that integrate disparate forms of genomic and proteomic 
data, such as protein-protein interactions and phosphorylation, have 
related gene regulation to other biological processes'* '°. However, for 
humans, systems-level analyses have been a challenge due to the size 
of the transcription factor repertoire and genome, and only specific 
regulatory subnetworks with a handful of factors have been reported 


sequences of many mammals”'”’, it is possible 
to obtain an unprecedented view of how selection relates to networks. 

Here we present an analysis of the genome-wide binding profiles of 
119 transcription-related factors, including sequence-specific, general 
and chromatin-acting factors. (For simplicity, we refer to all of these 
as transcription factors, and we use TFSS to denote canonical 
sequence-specific factors.) We first used the transcription-factor- 
binding data to analyse the co-association patterns between different 
factors, as well as their differential patterns in promoter-proximal and 
distal regulatory regions. We then organized the binding patterns into 
a stratified hierarchy representing the overall systems-level regulatory 
wiring. To this, we added other forms of network information, includ- 
ing non-coding RNA (ncRNA) regulation (especially microRNAs 
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(miRNAs))**”, protein-protein interactions*”®, and protein phos- 
phorylation’. We analysed this ‘meta-network’ for properties that 
differ based on hierarchical level and connectivity (for example, hubs 
versus non-hubs) and also searched for enriched network motifs. 
Finally, we surveyed the pattern of sequence variation over the net- 
work, examining selective pressure and allelic effects (preferential 
binding to the maternal or paternal allele). Several of our key findings 
are summarized below. 

e Human transcription factors co-associate in a combinatorial and 
context-specific fashion; different combinations of factors bind near 
different targets, and the binding of one factor often affects the 
preferred binding partners of others. Moreover, transcription factors 
often show different co-association patterns in gene-proximal and 
distal regions. 

e Different parts of the hierarchical transcription factor network 
exhibit distinct properties. For instance, the middle level has the most 
information-flow bottlenecks and, offsetting this, tends to have the most 
regulatory collaboration between transcription factors. Conversely, 
higher-level transcription factors have the greatest connectivity with 
other networks (for example, the phosphorylome). 

e The occurrence of the feed-forward loops is strongly enriched in the 
transcription factor network, as are a number of motifs in which two 
genes co-regulated by a factor are bridged by a protein-protein inter- 
action or regulating miRNA. 

e Highly connected network elements (both transcription factors and 
targets) are under strong evolutionary selection and exhibit 
stronger allele-specific activity (this is particularly apparent when 
multiple factors are involved). Surprisingly, however, elements with 
allelic activity are under weaker selection than non-allelic ones. 


25,26 


Overview of data and processing 


The ENCODE project has generated chromatin immunoprecipitation 
and high-throughput sequencing (ChIP-seq) data sets for 119 distinct 
transcription factors over five main cell lines (Supplementary 
Information, section B.1, and Supplementary Tables 1 and 2a). 
Each data set contains at least two biological replicates. In addition, 
for a select set of factors (Supplementary Fig. 1c), short interfering 
RNA (siRNA) experiments were performed, where the transcription 
factor was depleted and expression changes were quantified by RNA- 
seq (Supplementary Information, section B.2). Most of the factors (88, 
74%) are TFSSs that can be subcategorized on the basis of their DNA- 
binding domain sequences (Supplementary Table 2a)’*. A small sub- 
set (16, 13%) comprises POL2 and general transcriptional machinery; 
a final subset (15, 13%) consists of chromatin-modifying and 
remodelling factors. 

To allow effective integrative analysis of these diverse data sets, 
we developed a uniform processing pipeline and quality-control 
measures (Supplementary Information, section B.1, and Supplemen- 
tary Figs 1a, b and 2a; data at http://www.encodeproject.org). In total, 
we identified 7,424,765 peaks; 2,948,387 (~40%) were proximal 
(within +2.5 kilobases) to annotated gene transcription start sites 
(TSSs). 


Context-specific transcription factor co-association 


We first examined the genome-wide co-association of all pairs of 
transcription factors by analysing the overlap between peaks of all 
pairs of factors”®. Although many general trends can be identified, this 
approach does not take into account the context-specificity of tran- 
scription factor binding (that is, the observation that factors bind 
together in distinct combinations at different genomic locations, 
and that the co-binding of one pair of transcription factors is often 
affected by the binding of another transcription factor; Supplemen- 
tary Information, section C.1). Therefore, we developed a framework 
focusing on the specific genomic regions bound by a particular tran- 
scription factor (the focus factor) and examined the co-association of 
all other factors (partner factors) within this context (Supplementary 
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Fig. 2a). For each ~350-base-pair region in the focus-factor context, 
we extracted normalized binding signals of overlapping peaks of all 
transcription factors, generating a co-binding map. Figure 1a shows 
such a map for the GATA1 context. Here, factors that consistently co- 
associate with each other and a substantial proportion of GATA1 
peaks are termed ‘primary partners’ (for example, group 6 transcrip- 
tion factors such as GATA2 and TAL] in Fig. 1a). In addition to these 
factors, there are also groups of ‘local partners’ that co-associate with 
each other in the presence of GATA1, but only at specific subsets of 
GATA1-binding peaks (for example, JUN in group 7 and MAX in 
group 3; Fig. la and Supplementary Fig. 2c-1). These ‘biclusters’, 
typically containing two to five transcription factors, can be mutually 
exclusive or partially overlapping. 

To identify systematically all primary and local partners for each 
focus-factor context, we used a machine-learning approach. We 
derived nonlinear, combinatorial models of each focus-factor’s 
co-binding map relative to randomized control maps (Supplemen- 
tary Information, section C.2, and Supplementary Fig. 2a, b). Analysis 
of multivariate rules in these models, in turn, identified pairs and 
higher-order clusters of significantly co-associated transcription 
factors. Moreover, these co-associations are robust to peak overlap 
and calling thresholds (Supplementary Information, section C.4). 

The first statistic derived from the models is a relative importance 
(RI) score (Supplementary Information, section C.2.4.2), which gives 
the overall importance of each transcription factor in the model. It 
reflects the ‘size’ of the biclusters to which a particular transcription 
factor belongs, and it is related to the number of co-binding factors 
and the fraction of peak locations involved. For the GATA1 context 
(Fig. 1b and Supplementary Fig. 2c-2), primary partners TALI, 
GATA2 and POL2, as well as local partners MAX and JUN, have high 
RI scores. To reveal further the partnering in the focus-factor context, 
we computed co-association scores between all pairs and higher-order 
sets of transcription factors (Supplementary Information, section 
C.2.4). These scores measure the impact of the co-dependency 
implicit in a particular pair on the model as a whole, and they more 
directly probe the co-occupancy of transcription factors in the focus- 
factor context than does the RI score. For the GATA1 context, the 
co-association scores revealed both expected and novel pairings (for 
example, MYC-MAX-E2F6 and CCNT2-HMGN3, respectively; 
Fig. 1b, Supplementary Fig. 2c-2 and Supplementary Information, 
section C.3.1.4). Furthermore, GATA1 is usually associated with 
enhancer activity. However, the co-association score shows that it is 
connected to both repressive (for example, NRSF (also called REST) 
and HDAC2) and activating factors (for example, P300). This 
discordant behaviour has been observed previously”; here, it is borne 
out by expression studies and knockdowns (Supplementary 
Information, section C.3.1.4). In particular, after GATA1 knockdown, 
we found that 94 targets of GATA1 were significantly upregulated, 
and only 54 were downregulated (Supplementary Fig. 2e-4). 
Finally, we analysed the functions of genes that lie near clusters of 
co-associated factors, and found that many are enriched for specific 
biological functions (Supplementary Fig. 2e-2). For example, one 
bicluster involving E2F6 (E2F6-GATAI-GATA2-TAL]1) was 
enriched for genes related to myeloid differentiation, whereas another 
(E2F6-SP1-SP2-FOS-IRF1) was involved in DNA damage response 
(Supplementary Information, section C.3.3). Thus, distinct combina- 
tions of factors regulate specific types of genes. 


Comparing co-association across contexts 

Aggregate RIM and PPM 

After establishing the co-binding structure in each transcription 
factor context, we compared our co-association statistics across con- 
texts. In particular, we combined the RI scores for each transcription 
factor into a single matrix (RIM, Supplementary Fig. 2a). Clustering 
reveals nine functionally distinct classes of transcription factor con- 
texts that fall into four broad groups: proximal, distal, repressive 
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Figure 1 | Transcription factor co-association. a, The co-binding map for the 
GATAI focus-factor context in K562 cells shows the binding intensity of peaks 
of all transcription factors (TFs) in K562 (rows) that overlap each GATA1 peak 
(columns). The coloured rectangles represent eight key clusters consisting of 
different combinations of co-associating partner-factors. b, The GATA1 
context-specific relative importance (RI) scores of all partner factors (top) and 
the matrix of co-association scores (CS) between all pairs of factors (bottom). 
Primary and local partners of GATA have high RI scores. The co-association 
score matrix captures the eight clusters observed in a. c, Different partner factors 
are preferentially enriched at gene-distal (positive differential RI) and proximal 
(negative differential RI) GATA1 peaks. d, The aggregate factor importance 
matrix (RIM), obtained by stacking the RI scores of all partner factors (columns) 


and mixed (Fig. 1d, Supplementary Fig. 2f-1 and Supplementary 
Information, section C.3.4.1). Next, combining the co-association 
scores from all focus factors across different contexts provides an 
overall view of all the primary partners of each transcription factor 
in the form ofa primary-partner matrix (PPM; Supplementary Fig. 2f-4). 
The RIM reflects the overall similarities in the binding context of 
focus factors, whereas the PPM highlights the specific factors that tend 
to co-bind with each other (mutual primary partners). To some degree, 
one can see the PPM asa subset of the relationships implicit in the RIM. 
That is, two factors can have similar binding contexts without explicit 
co-association—for example, two factors that tend both to bind 
promoters but near different sets of genes. Overall, the PPM shows 
well known sets of co-associated transcription factors, such as FOS- 
JUN (the AP1 complex*’*') and CTCF-RAD21-SMC3 (the cohesion 
complex****), as well as many novel co-associations, such as CHD2- 
ZBTB33, EGR1I-ZBTB7A and CTCF-ZNF143-SIX5 (Supplementary 
Information, section C3.6.2). We confirmed one novel co-association 
(CEBPB-TAL1) using co-immunoprecipitation and mass spectro- 
metry (Supplementary Table 3a). 


Variability map 

The variability map shows the degree of variability in the partners of 
a given transcription factor over contexts (as determined by the 
co-association score) (Supplementary Information, section C.2.5.5). 
For instance, Fig. le shows that GATA1 has mostly the same partners 


Differential RI (distal bias) 


from all focus-factor contexts (rows) in K562 cells, shows nine functionally 
distinct clusters (C1 to C9) of contexts that can be broadly grouped as distal, 
proximal, mixed and repressive. The blue rectangles highlight representative 
partner factors with high RI scores in the clusters. The arrow from b to 

d indicates that the GATA1 context-specific RI scores form one row in this 
matrix. e, Co-association variability map of partners (columns) of GATA1 (left 
panel) and FOS (right panel) over all K562 focus-factor contexts (rows). TAL1 
and GATA2 show consistently high co-association scores with GATA1 over 
most focus-factor contexts, but JUND shows context-specific co-association. 
FOS shows marked changes in co-association score of partner factors over 
different contexts (for example, FOS-JUND in distal contexts and FOS-SP2 in 
proximal ones). (More details are available in Supplementary Fig. 2c, d, f-1, 1-2.) 


in many contexts (for example, TAL1 and GATA2 are partners over 
almost all contexts). However, a few partners (for example, JUND) 
are present in only some contexts. An extreme example is FOS, 
which completely changes its partners in different contexts (Fig. le, 
Supplementary Fig. 21-2 and Supplementary Information, section 
C.3.6.1). 


Cell-type differences 

We analysed transcription factor co-association in the five main 
ENCODE cell types (Supplementary Information, section C.3.4). 
The GM12878 and K562 cell lines have the most common (31) tran- 
scription factor data sets (Supplementary Information, section C.3.5). 
Comparative analysis showed that over 80% of the transcription 
factor pairs had no significant change in co-association between 
K562 and GM12878 cell lines. However, there were a few marked 
examples of cell-line differences. For instance, FOS and JUND 
co-associate in K562 but not in GM12878 cells (Supplementary 
Information, section 3.5.1), despite the fact that most of the other 
partners of FOS are maintained in both cell lines. 


Gene context: proximal versus distal 

Overall, we found distinct partner preferences at proximal and distal 
sites. These results were robust to the choice of the distance used to 
define proximal and distal regions (Supplementary Fig. 2c-3). In 
particular, for the GATA1 context, we found that RI scores change 
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markedly between proximal or distal sites (Fig. 1c and Supplementary 
Fig. 2c-3): typical core promoter transcription factors (for example, 
POL2, E2F6, MAX and ELF1) have a significant proximal promoter 
bias, whereas JUND, JUNB, JUN and P300 show preferential 
co-association with distal sites. Another way of analysing differences 
between proximal and distal sites is in the framework of the variability 
map, in which one can observe the changing partners of a transcrip- 
tion factor in different contexts. For instance, FOS has completely 
different partners with which it co-associates proximally and distally 
(Fig. le, Supplementary Fig. 21-2 and Supplementary Information, 
section C.3.6.1). 


Assembling pairwise interactions into hierarchies 
Analysis of co-associations specifies the relationships between the 
DNA-binding profiles of multiple regulators. To obtain a systems- 
level perspective, we recast transcription factor associations as a net- 
work (Supplementary Fig. 4a), wherein the nodes are regulators or 
their targets, and the edges designate regulatory relationships. Here, 
we focussed on the global wiring pattern across all cell types. We 
expected different subnetworks within this framework to be active 
to different degrees in different cells. 

Using our binding-site list, we identified an initial set of regulatory 
targets from genes having promoter-proximal binding sites. The 
resulting raw network consists of 500,542 promoter-associated 
interactions between transcription factors and all their putative tar- 
gets, of which 4,809 are between pairs of factors (networks at http:// 
encodenets.gersteinlab.org). We filtered this to identify the most 
confident interactions using a probabilistic model, giving 26,070 
total interactions, with only 338 between transcription factors** 
(Supplementary Information, section D.1). We validated the perform- 
ance of the filtering using the siRNA experiments; for each case, the 
targets identified by our model were more differentially expressed in 
siRNA-treated cells than were those identified by a simple peak-based 
method (Supplementary Fig. 1c-e). 

We next computed common connectivity statistics for individual 
transcription factors, namely, out-degree (O), in-degree (I) and 
betweenness, which were then used to identify hubs and informa- 
tion-flow bottlenecks (Supplementary Information, section K). Of 
particular interest is the difference between out- and in degree 
(O—J), which measures the direction of information flow (Sup- 
plementary Fig. 3a). A positive value suggests that a transcription 
factor is located ‘upstream’ in the network, whereas a negative value 
indicates that it is ‘downstream’. We further defined a normalized 
version of this ‘hierarchy height’ metric, h =(O-—1)/(O+ I). We 
found that this can be approximated by three levels (Supplementary 
Fig. 3c), with top-level, ‘executive’ transcription factors regulating 
many other factors (h ~ 1), and bottom-level ‘foreman’ factors more 
regulated than regulating (h ~ —1). For purposes of visualization, we 
used a simulated-annealing procedure to optimally and robustly 
arrange the 119 transcription factors into three discrete levels (with 
the number of downward-pointing edges maximized) (Fig. 2a and 
Supplementary Information, section D.2). 


Layering on distal, ncRNA and protein interactions 

The filtered transcription factor hierarchy consists of the strongest 
promoter-associated interactions. Building upon this skeleton, we 
added additional types of connections. 

Interactions involving distal regulatory elements (for example, 
enhancers) are more difficult to identify than those involving 
proximal elements. Here, we used a statistical model**. This identifies 
distal sites with potentially many binding transcription factors using 
chromatin features. These regions were associated with a gene if their 
changing pattern of chromatin marks across cell lines correlates with 
the expression of that gene (Supplementary Information, section E.1). 
Overall, the model identified 19,258 distal edges (Fig. 2a). 
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The regulatory interactions between transcription factors and 
ncRNAs constitute an additional layer of information to add to the 
meta-network. We used transcription factor peaks proximal to 
ncRNAs to identify transcription-factor-to-ncRNA regulation. Next, 
we incorporated miRNA-to-transcription-factor regulatory interac- 
tions from TargetScan*® (Supplementary Information, section E.2). 
Finally, we incorporated physical protein-protein interactions”, as 
well as predicted phosphorylations (Supplementary Information, 
section F.3, and Supplementary Fig. 7a). Overall, these different 
interactions form a dense meta-network that we analysed further for 
interesting biological properties. 


Relating network connectivity and genomic properties 
We next correlated measures for the connectivity and hierarchical 
position of each transcription factor with a wide variety of genomic 
and proteomic properties (Fig. 2c, Table 1 and Supplementary Table 
4, P values in the latter). 


Correlations with distal edges 

Distal edges have a different degree distribution than do proximal 
ones (Fig. 2a and Supplementary Fig. 5). Inspection reveals that many 
point upward in the transcription factor hierarchy, opposite to most 
proximal edges. Furthermore, we found many transcription factors 
with low in-degree values in the proximal network but high 
in-degree values in the distal one, indicating that they are heavily 
regulated through enhancers (Supplementary Fig. 5a). Some of these 
are well known condition- and tissue-specific regulators (for example, 
IRF4 and GATA1)*”. 


Correlations within the proximal network 

Upper-level transcription factors tend to have more targets than 
lower-level ones, both overall and when considering only other tran- 
scription factors as targets. As measured by betweenness in proximal 
regulation, middle-level transcription factors form information-flow 
bottlenecks (Fig. 2c). Moreover, betweenness in the proximal network 
is correlated with more distal regulation. This tends to increase the 
information flow through mid-level bottlenecks even more. (See 
Supplementary Information section F.3.6 for clarification of the 
implications.) 


Correlation with protein interactions and the phosphorylome 
We found that top-level transcription factors tend to have more 
partners in the protein-interaction network than do lower-level ones 
(Fig. 2c and Table 1). We further studied how transcription factors 
in different levels are regulated by kinases. Although there is no 
significant difference in terms of the number of kinases regulating 
transcription factors at different levels, we found that if the 
phosphorylome is arranged into a hierarchy using the same approach 
used for organizing the transcription factor network, kinases at 
the bottom tend not to phosphorylate transcription factors, but they 
tend to be regulated by them (particularly by top-level factors; 
Supplementary Fig. 7). 


Correlation with ncRNAs 

We found that top- and middle-level transcription factors have the 
highest total number of ncRNA targets (Fig. 2c, Table 1 and Sup- 
plementary Fig. 6a), consistent with our findings for protein-coding 
targets. We then developed a score indicating the fraction of a tran- 
scription factor’s total regulation devoted to ncRNAs, relative to 
protein-coding genes (Supplementary Information, section E.2); this 
identified several factors that preferentially target ncRNAs, such as 
BDP1 and BRF2 (Supplementary Fig. 6b, c). 

Matching the pattern for ncRNAs in general, most of the transcrip- 
tion factors involved in miRNA regulation tend to be top- or middle- 
level ones (Fig. 2c). Moreover, highly connected transcription factors 
tend to regulate more miRNAs and to be more regulated by them 
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Figure 2 | Overall network. a, Close-up representation of the transcription 
factor hierarchy. Nodes depict transcription factors. TFSSs are triangles, and 
non-TFSSs are circles. Left: proximal-edge hierarchy with downward pointing 
edges coloured in green and upward pointing ones coloured in red. The nodes 
are shaded according to their out-degree in the full network (as described in 
Table 1). Right: factors placed in the same proximal hierarchy but now with 
edges corresponding to distal regulation coloured green and red, and nodes re- 
coloured according to out-degree in the distal network. The distal edges do not 
follow the proximal-edge hierarchy. b, Close-up view of transcription-factor- 
miRNA regulation. The outer circle contains the 119 transcription factor, 
whereas the inner circle contains miRNAs. Red edges correspond to miRNAs 
regulating transcription factors; green edges show transcription factors 
regulating miRNAs. Transcription factors and miRNAs each are arranged by 
their out-degree, beginning at the top (12:00) and decreasing in order clockwise. 
Node sizes are proportional to out-degree. For transcription factors, the 


Table 1 | Correlating properties with centrality and hierarchy height 


out-degree is as described in Table 1; for miRNAs, it is according to the out- 
degree in this network. Red nodes are enriched for miRNA-transcription factor 
edges and green nodes are enriched for transcription factor-miRNA edges. 
Grey nodes have a balanced number of edges (within +1). c, Average values of 
various properties (topological, dynamic, expression-related and selection- 
related—ordered consistently with Table 1) for each level are shown for the 
proximal-edge hierarchy. The top, middle and bottom rows correspond to the 
top, middle and bottom of the hierarchy, respectively. The sizing of the grey 
circles indicates the relative ordering of the values for the three levels. 
Significantly different values (P < 0.05) using the Wilcoxon rank-sum test are 
indicated by black brackets. The proximal-edge hierarchy depicted on the right 
shows non-synonymous SNP (ns-SNP) density, where the shading 
corresponds to the density for the associated factor. (See Supplementary Fig. 4 
for more details.) 


Correlation with: 


Category Property Degree centralityt Betweenness centrality (O-N/O+N 
Full TF-TF Full TF-TF TF-TF 
Topology Number of TF partners in PPI 0.28+ 0.27+ 0.25* 0.33+ 0.08 
Topology Number of miRNA regulators 0.24* 0.33+ —0.02 0.00 0.29% 
Topology Number of ncRNA targets 0.65+ 0.49+ 0.34+ 0.35+ 0.22* 
Topology Number of miRNA targets 0.62+ 0.50+ 0.33+ 0.34+ 0.19* 
Topology Number of distal targets 0.32+ 0.24* 0.19* 0.23* 0.07 
Dynamics Amount of rewiring -0.14 -0.12 0.44* 0.35 -0.42* 
Expression Expression level 0.14 0.12 0.23* 0:27* —0.04 
Expression Binding-expression correlation 0.41+ 0.31+ 0.30+ 0.364 0.19* 
Selection properties for factors ns-SNP density —0.19* —0.27* —0.01 —0.03 —0.22 
Selection properties for factors Allelicity 0.20 0.28* =0.10 =0.16 0.18 
Selection properties for targets ns-SNP density —0.05+ = = = = 
Selection properties for targets dN/dS —0.05+ - - - - 


Spearman correlation values of various properties (topological, dynamic, expression-related and selection-related) with centrality measures and hierarchy height. Only properties that are significantly correlated 
with centrality or hierarchy height are listed. For a full set of properties, P values and explanations, see Supplementary Tables 4 and 6. dN/dS, non-synonymous to synonymous mutation ratio. 


* Spearman correlation P<0.05. 
+ Spearman correlation P< 0.01. 


{Degree centrality refers to out-degree, except for selection properties on targets, in which case it refers to in-degree. In particular, out-degree in the full transcription factor target network refers to the ‘Targets’ 


column in Supplementary Table 4a, and the same quantity is used throughout Fig. 2. 


6 SEPTEMBER 2012 | VOL 489 | NATURE | 95 


©2012 Macmillan Publishers Limited. All rights reserved 


ARTICLE 


(Table 1 and Fig. 2b). However, when we analyse transcription- 
factor-miRNA regulation in detail we find that the factors most 
involved in miRNA regulation tend to either largely regulate or be 
regulated by miRNAs (Fig. 2b and Supplementary Fig. 4d). That is, 
there are few high-degree transcription factors with ‘balanced regu- 
lation’ (similar numbers of incoming and outgoing edges, relative to a 
control; Supplementary Fig. 3m). The same pattern can be seen for 
miRNAs (Supplementary Fig. 31). 


Correlation with families and functional categories 
Chromatin-related factors are enriched at the top of the hierarchy, 
whereas TFSSs are enriched in the middle (Supplementary Table 5a 
and Supplementary Information, section F.1). Also, TFSSs show a 
greater degree of tissue specificity and are more highly regulated by 
miRNAs than are general and chromatin-related factors (Supplemen- 
tary Information, section F.4), indicating that they may be more finely 
tuned in their expression. Examining functional enrichment, we 
found that transcription factors at the top of the hierarchy tend to 
have more general functions, and those at the bottom tend to have 
more specific functions (Supplementary Table 5c and Supplementary 
Information, section F.1). 


Correlation with network dynamics 

We studied how transcription factors change their binding patterns 
among different cell types, principally between the K562 and 
GM12878 cell lines. We quantified the amount of ‘rewiring’ as the 
fraction of unshared targets, normalized by the union of two target 
sets (Supplementary Information, section 3.5). We found that this 
‘rewiring score’ is negatively correlated with hierarchy height 
(Fig. 2c and Table 1). This means that the targets of lower-level tran- 
scription factors tend to change more between cell types, consistent 
with their role in more specialized processes. 


Correlation with gene expression 

We calculated the average expression levels of transcription factors 
across 34 tissues; highly connected factors tend to be highly 
expressed. We further examined the relationship between connectivity 
and expression by calculating, for each transcription factor, the cor- 
relation between its binding signal around its targets and the level of 
target expression (Supplementary Information, section F.3.4). This 
binding—expression correlation is positively correlated with factor con- 
nectivity. Moreover, transcription factors at the top and middle levels 
show a greater correlation. Thus, more ‘influential’ transcription 
factors tend to be better connected and higher in the hierarchy. 
(This degree of “influence” becomes even clearer when one considers 
weighting the correlation by the number of transcription factor targets, 
given that higher-level factors tend to have more targets.) However, 
somewhat surprisingly, a model integrating the binding-expression 
relationships of all the highly connected transcription factors has about 
the same predictive power for expression as a model integrating all the 
less connected ones, indicating that the weak binding-expression rela- 
tionships of the less influential factors are collectively quite influential 
(Supplementary Information, section F.3.4)**. 


Collaboration between hierarchy levels 

We explored how transcription factors in the top, middle and bottom 
(T, M and B, respectively) levels of the hierarchy collaborate, in terms 
of both inter-level (TM, MB, TB) and intra-level (TT, MM, BB) rela- 
tionships (Fig. 3a). We examined three kinds of collaboration: 
co-association (as described earlier), physical interactions, and 
target-expression cooperativity. We defined two transcription factors 
as being cooperative if their shared targets are significantly different in 
expression from their unshared targets (Supplementary Information, 
section G.2). Overall, we found that collaborations involving the 
middle level (and to a lesser extent, the top one) tended to be enriched. 
In particular, TM and MM transcription factor pairs influenced gene 
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Figure 3 | Collaboration between levels. a, Enrichment of collaborating 
transcription factor pairs from different levels (top (T), middle (M) and bottom 
(B)). The factors are represented by two nodes below each bar graph. The 
dashed orange line indicates the expected level of collaboration. Significant 
enrichment above or depletion below that level is marked by asterisks 

(P< 0.05). (See Supplementary Information section G.1.2 for more details.) 
b, Enrichment of proximal and distal co-regulatory pairs in the network 
hierarchy. Co-regulatory pairs from different levels are shown by the two nodes 
below each bar. 


expression cooperatively. Next, all co-associations involving top- and 
middle-level factors are enriched, whereas those involving the bottom 
level are depleted. A similar pattern was observed for protein-protein 
interactions, with TT and TM co-regulation more likely to occur 
between physically interacting transcription factors (Fig. 3a and 
Supplementary Information, section G.1). 

Finally, we analysed how proximal and distal sites ‘collaborate’. We 
identified pairs of transcription factors that bind to the promoter and 
distal regulatory regions of the same target gene (Supplementary 
Information, section G.3) and studied their respective locations in 
the factor hierarchy. We found an asymmetry between proximal 
and distal regulation, with transcription factors associated through 
promoter regulation more likely to reside in upper levels (Fig. 3b). 


Enriched network motifs 


Apart from its global structure, we further studied the network from 
the perspective of its constituent building blocks; that is, network 
motifs, which are small connectivity patterns that carry out canonical 
functions”. We systematically searched for motifs, first in the 
promoter-regulation hierarchy and then in the meta-network includ- 
ing distal, miRNA and protein-protein interactions. Our procedure 
was to instantiate all possible motifs for broad template patterns 
and then determine which of these were significantly over- or 
under-represented relative to a random control*’ (Supplementary 
Information, section H). For instance, starting with all possible 
three-transcription-factor motifs in the proximal network (Fig. 4a), 
we found the most enriched motif to be the well-studied feed-forward 
loop (FFL)*’. In agreement with the observed collaborations within 
the hierarchy, many FFLs involve the middle level (Supplementary 
Fig. 9a). Moreover, by analysing the expression levels of the con- 
stituent genes of the FFLs over many tissues, we found that many 
were positively correlated, highlighting the tight regulation implicit in 
the motif (Fig. 4a and Supplementary Information, section H.1). 
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Figure 4 | Motif analysis. Motifs are accompanied by the number of 
occurrences, n. Enriched motifs are highlighted in green; depleted ones in red. 
An asterisk means that the corresponding enrichment/depletion is statistically 
significant (P= 1 X 10 °). The motifs are sorted such that those at the ends 
have more significant P values. (See Supplementary Fig. 9h for more details.) 
a, Systematic search of three-transcription-factor motifs. The most enriched 
motif is the FFL. A particular example formed by STAT1, STAT3 and RUNX1 is 
highlighted. Here, the ‘+’ symbol on an edge indicates that the correlation 
between the gene expression of the source and the target across tissues is 
positive. Other motifs containing a toggle-switch regulation on top of the basic 
FFL design are also indicated. b, Proximal—distal PPI MIMs. Here we searched 
all motifs involving the co-regulation of two transcription factors (which could 


Finally, we found further enriched three-transcription-factor motifs 
containing an additional regulation on top of that in a FFL. This 
creates a mutual regulation between a pair of transcription factors, 
instantiating a toggle-switch, which has been shown to have an essen- 
tial role in the determination of cell fate*’. 

Next, we analysed another template: all possible multiple-input 
modules (MIMs, defined in Supplementary Information, section K) 
involving promoter and distal regulation and a protein-protein 
interaction (proximal-distal PPI MIMs, Fig. 4b). We found that 
co-regulating transcription factors are likely to interact physically, 
indicating that they work together as a complex. Moreover, the motif 
ranking second in enrichment consists of a distal regulatory relation- 
ship, a promoter regulatory relationship, and a protein-protein inter- 
action. This is suggestive of a common picture of DNA looping, with 
an interacting complex of transcription factors binding to the pro- 
moter and enhancer simultaneously. 

The connection between co-regulated entities extends to miRNA 
regulation. We surveyed all possible instances of a miRNA regulating 
two transcription factors (miRNA SIM, Fig. 4c) and found that the 
miRNAs are more likely to regulate a pair of physically interacting 
factors. This enrichment indicates that, to avoid unwanted cross-talk, 
a miRNA tends to shut down an entire functional unit (that is, tran- 
scription factor complex) rather than just a single component. 
Similarly, we found that miRNAs tend to target a pair of transcription 
factors binding both proximally and distally (Fig. 4c). This suggests 
that miRNA represses the expression of both promoter and distal 
regulators to shut down a target completely. Apart from miRNAs, 
we also studied motifs involving other kinds of ncRNAs. Among 
motifs involving a transcription factor regulating two ncRNAs, there 
is great enrichment for both ncRNAs to be long intergenic non- 
coding RNAs (lincRNAs) (Supplementary Information, section H.2). 

Finally, we found the network to be enriched for auto-regulators 
(28 out of 119 transcription factors), a simple but important motif, 
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be either proximal or distal) with (or without) a protein-protein interaction 
between them. Motifs containing the protein-protein interaction tended to be 
enriched. c, miRNA SIMs. The two enriched motifs resulting from enumerating 
all motifs in which a miRNA targets two transcription factors that are connected 
in various ways are shown. These two motifs contain a protein complex of two 
transcription factors and a cooperative pair of promoter and distal regulatory 
transcription factors. d, The auto-regulator motif is enriched in the 
transcription factor-transcription factor network: 28 of all factors are auto- 
regulators. Moreover, auto-regulators are more likely to be repressors (—) 
relative to non-auto regulators, and they tend to have more ncRNAs as their 
targets. In the box plots, the red line indicates the median, the blue box shows 
the interquartile range (IQR), and whiskers extend out to 1.5 IQR. 


which are commonly found in networks exhibiting multistability”. 
Moreover, we found that the auto-regulators tend to be repressors, 
representing a well known design principle for maintaining steady 
state (Fig. 4d). 


Allelic behaviour in a network framework 
We examined the relationship between sequence variation and 
transcription factor regulation. In particular, we investigated the coor- 
dination between allele-specific binding and allele-specific express- 
ion**, We used the sequenced data sets for the GM12878 cell line, 
which has a deeply sequenced diploid genome (Supplementary 
Information, section I.1). We extended pairwise analysis of allele-spe- 
cific behaviour” to study higher-order coordination of multiple factors 
regulating a common target. We first generated the unfiltered, pro- 
moter-regulation network for GM12878 cells and then identified a sub- 
network within it representing the difference between maternal- and 
paternal-specific networks (Supplementary Information, section I.2). 
This subnetwork is shown in Fig. 5a, with 4,798 transcription-factor- 
target edges coloured red or blue to represent predominantly maternally 
or paternally regulated targets; the targets are similarly coloured to 
indicate predominantly maternal or paternal expression. We found that 
of the 4,798 allele-specific binding cases of a single factor regulating its 
associated target, 57% showed coordinated allelic binding and expres- 
sion. We then found that for the cases in which two transcription factors 
regulate a common target, 63% were consistent (that is, both factors 
bind to the same allele that is expressed). For those cases in which triplets 
of transcription factors regulate a common target, the consistency 
increased to 65%. This trend continues, demonstrating that, as one 
increases the degree of combinatorial regulation, there is a progressively 
stronger relationship between expressed and regulated alleles. 

The degree of allele-specific behaviour of each transcription factor 
can be quantified by a statistic that we call ‘allelicity’. The allelicity ofa 
transcription factor is defined as the fraction of single nucleotide 
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Figure 5 | Allelic effects. a, An ‘allelic effects network’ depicting the increasing 
coordination between allele-specific binding and allele-specific expression as 
the number of factors regulating a target increases. Central white nodes denote 
transcription factors, and peripheral nodes denote targets, which are blue (red) 
if they are expressed from the paternal (maternal) allele. Blue (red) edges denote 
allele-specific binding to the paternal (maternal) allele. This network represents 
the strongest differences between the paternal- and maternal-specific 
regulatory networks. As one goes around the larger circle anticlockwise 
(clockwise), each of the small circular clusters represents targets with 
progressively more paternal (maternal) regulation, indicated by the small blue 
(red) numbers to the side of the clusters. Moreover, within each of the clusters 
the fraction of predominantly paternally (maternally) expressed targets 
increases as one goes around the larger circle. As an illustration, this fraction is 
explicitly indicated by the ratios within three of the larger clusters at the bottom 
right. b, Relationship between transcription factor allelicity and selection. The 
bar height is the ratio of the degree of selection (as measured by SNP density or 
average DAF) in those binding peaks showing allelic behaviour to the degree of 
selection in all other binding peaks. Asterisks represent significant differences 
(P < 0.05, Wilcoxon rank-sum test). (See Supplementary Information section 
1.2 and Supplementary Fig. 10b, c for details.) 


polymorphisms (SNPs) that exhibit allele-specific binding out of all 
the SNPs that may potentially exhibit it (Supplementary Information, 
section I.3). Thus, qualitatively, allelicity may be thought of as the 
sensitivity of a transcription factor’s binding to maternal-versus- 
paternal variants. Using our network described here, we find that 
transcription factors with higher degrees of allelicity tend to have 
more target genes, indicating that these factors tend to vary more in 
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their binding with sequence (Table 1). Finally, we found that 
small insertions and deletions (indels) tended to cause dispropor- 
tionally more of these allelic events than did SNPs (Supplementary 
Table 6g). 


Selection in a network context 


Previous studies have examined the relationship between evolutionary 
selection and position in the human protein-protein interaction net- 
work”. However, the analogous relationship in the regulatory network 
has not yet been explored. 


Selection 

To address this, we first analysed the selective pressure on both tran- 
scription factors and their targets. We predominantly used non- 
synonymous SNP density from the 1000 Genomes Pilot’ to determine 
selection among modern-day humans (Supplementary Information, 
section J). We also verified our results using other measures of 
selection (that is, derived allele frequency (DAF) and the ratio of 
non-synonymous to synonymous SNP rates (pN/pS statistic) (Sup- 
plementary Information, section J)). For selection over longer time- 
scales, we calculated the ratio of non-synonymous to synonymous 
substitution rates in human-chimp orthologue alignments (dN/dS). 
We found significant negative correlation between the regulatory 
in-degree of target genes and both their non-synonymous SNP density 
and dN/dS values (Table 1 and Supplementary Table 6e). Thus, target 
genes regulated by more transcription factors are under stronger nega- 
tive selection. Similarly, we found that there is a significant negative 
correlation between transcription factor regulatory out-degree and 
non-synonymous SNP density (Table 1 and Supplementary Table 
6d). We observed a consistent result with transcription factor dN/dS 
values and other measures of selection, although these are not all as 
statistically significant (Supplementary Table 6d and Supplementary 
Information, section J). This shows that transcription factors regulat- 
ing more targets tend to be under stronger negative selection. 
Moreover, within the transcription factor hierarchy, we found that 
factors at the top are under significantly stronger negative selection 
(Fig. 2c, Table 1 and Supplementary Table 6b). 

Consistent with all of these results relating connectivity with con- 
straint, we found that genes tolerant of loss-of-function mutations**, 
which are under weaker negative selection, have a significantly lower 
total degree (I + O) than other genes (Supplementary Information, 
section J). 


Selection and allelic effects 

Finally, we attempted to relate selection and allelic effects. We 
extracted transcription-factor-binding peaks in promoters and gene 
bodies showing allele-specific binding, and compared the selective 
pressure in these against a control (binding peaks within the same 
regions without allele-specific binding). We found that transcription- 
factor-binding peaks exhibiting allelic effects have higher SNP 
densities relative to the control (Fig. 5b). Moreover, binding peaks 
with no allelic effects show a skew in the DAF spectrum towards rarer 
SNPs, relative to allele-specific binding ones (Fig. 5b and Supplemen- 
tary Fig. 10c). The same trend holds true for indels and structural 
variants (Fig. 5b and Supplementary Fig. 10b, c). Interestingly, these 
results indicate that allelic regulation seems to be under less selective 
constraint. 


Discussion 

This study provides the first detailed analysis of how human regula- 
tory information is organized. A number of clear design principles 
emerge from it. Many of these are shared with model organisms 
(Supplementary Table 7), demonstrating that they are general 
features of transcription factor regulation. First, we found that the 
connectivity and hierarchical organization of regulatory factors is 
reflected in many genomic properties. For instance, top-level 
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transcription factors have their binding more strongly correlated with 
the expression of their targets, perhaps indicating that they are more 
influential, as reported for model organisms*’. Next, the middle-level 
contains information-flow bottlenecks and much connectivity with 
miRNA and distal regulation. Targeting these bottlenecks (for 
example, by drugs) is likely to most strongly affect the flow of 
information through regulatory circuits. To some degree, the cell 
mitigates the effect of bottlenecks by having pairs of middle-level 
transcription factors collaborate in regulation. (Co-regulation 
mitigates bottlenecks.) Third, the regulatory network seems to be built 
from repeated reuse of small, modular motifs. In particular, regulation 
between levels involves many feed-forward loops, which could be 
used to filter fluctuations in input stimuli. Again, these properties 
are shared with model organisms; the network motifs and cooperating 
middle-level have been observed in yeast®. 

By contrast, the differences in proximal and distal regulation seem 
to be a unique feature of human regulation. This finding is evident 
in the analysis of both transcription factor co-association and 
network structure. The proximal-distal differences reflect the much 
larger intergenic space in humans than model organisms and the 
commensurately larger amount of distal binding. Finally, analysis of 
conservation indicates that more highly connected parts of the net- 
work are under stronger selection, consistent with results from model 
organisms. However, one unique finding for humans is ‘allelic’ effects. 
More highly connected transcription factors are more likely to exhibit 
allele-specific binding. Interestingly, we found that the actual allele- 
specific binding sites tend to be under less selection. Unravelling this 
interaction between selection and regulatory networks will be crucial 
to interpreting variants in the many personal genome sequences 
expected in the future. Co-published ENCODE-related papers can 
be explored online via the Nature ENCODE explorer (http://www.na- 
ture.ccom/ENCODE), a specially designed visualization tool that 
allows users to access the linked papers and investigate topics that 
are discussed in multiple papers via thematically organized threads. 


METHODS SUMMARY 


Detailed methods associated with each section of the paper are in a similarly titled 
section of the Supplementary Information. In particular, an overview of our data 
processing pipeline is in Supplementary Information, section B. 
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Eukaryotic cells make many types of primary and processed RNAs that are found either in specific subcellular 
compartments or throughout the cells. A complete catalogue of these RNAs is not yet available and their 
characteristic subcellular localizations are also poorly understood. Because RNA represents the direct output of the 
genetic information encoded by genomes and a significant proportion of a cell’s regulatory capabilities are focused on 
its synthesis, processing, transport, modification and translation, the generation of such a catalogue is crucial for 
understanding genome function. Here we report evidence that three-quarters of the human genome is capable of 
being transcribed, as well as observations about the range and levels of expression, localization, processing fates, 
regulatory regions and modifications of almost all currently annotated and thousands of previously unannotated 


RNAs. These observations, taken together, prompt a redefinition of the concept of a gene. 


As the technologies for RNA profiling and for 
cell-type isolation and culture continue to 
improve, the catalogue of RNA types has grown 
and led to an increased appreciation for the 
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subcompartments (nucleus and cytosol) for 
all 15 cell lines studied, and in three additional 
subnuclear compartments in one cell line. In 
addition, we have sought to determine whether 


numerous biological functions carried out by 
RNA, arguably putting them on par with the functional importance 
of proteins’. The Encyclopedia of DNA Elements (ENCODE) project 
has sought to catalogue the repertoire of RNAs produced by human 
cells as part of the intended goal of identifying and characterizing the 
functional elements present in the human genome sequence’. The 
five-year pilot phase of the ENCODE project’ examined approxi- 
mately 1% of the human genome and observed that the gene-rich 
and gene-poor regions were pervasively transcribed, confirming 
results of previous studies*®. During the second phase of the 
ENCODE project, lasting 5 years, the scope of examination was broa- 
dened to interrogate the complete human genome. Thus, we have 
sought to both provide a genome-wide catalogue of human transcripts 
and to identify the subcellular localization for the RNAs produced. 
Here we report identification and characterization of annotated and 
novel RNAs that are enriched in either of the two major cellular 


identified transcripts are modified at their 5’ 
and 3’ termini by the presence of a 7-methyl guanosine cap or 
polyadenylation, respectively. We further studied primary transcript 
and processed product relationships for a large proportion of 
the previously annotated long and small RNAs. These results con- 
siderably extend the current genome-wide annotated catalogue of 
long polyadenylated and small RNAs collected by the GENCODE 
annotation group®*. Taken together, our genome-wide compilation 
of subcellular localized and product-precursor-related RNAs serves as 
a public resource and reveals new and detailed facets of the RNA 
landscape. 

e Cumulatively, we observed a total of 62.1% and 74.7% of the human 
genome to be covered by either processed or primary transcripts, 
respectively, with no cell line showing more than 56.7% of the union 
of the expressed transcriptomes across all cell lines. The consequent 
reduction in the length of ‘intergenic regions’ leads to a significant 
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overlapping of neighbouring gene regions and prompts a redefinition 
of a gene. 

e Isoform expression by a gene does not follow a minimalistic 
expression strategy, resulting in a tendency for genes to express many 
isoforms simultaneously, with a plateau at about 10-12 expressed 
isoforms per gene per cell line. 

¢ Cell-type-specific enhancers are promoters that are differentiable 
from other regulatory regions by the presence of novel RNA tran- 
scripts, chromatin marks and DNase I hypersensitive sites. 

e Coding and non-coding transcripts are predominantly localized in 
the cytosol and nucleus, respectively, with a range of expression span- 
ning six orders of magnitude for polyadenylated RNAs, and five 
orders of magnitude for non-polyadenylated RNAs. 

e Approximately 6% of all annotated coding and non-coding tran- 
scripts overlap with small RNAs and are probably precursors to these 
small RNAs. The subcellular localization of both annotated and 
unannotated short RNAs is highly specific. 


RNA data set generation 


We performed subcellular compartment fractionation (whole cell, 
nucleus and cytosol) before RNA isolation in 15 cell lines (Supplemen- 
tary Table 1) to interrogate deeply the human transcriptome. For the 
K562 cell line, we also performed additional nuclear subfractionation 
into chromatin, nucleoplasm and nucleoli. The RNAs from each of 
these subcompartments were prepared in replica and were separated 
based on length into >200 nucleotides (long) and <200 nucleotides 
(short). Long RNAs were further fractionated into polyadenylated and 
non-polyadenylated transcripts. A number of complementary tech- 
nologies were used to characterize these RNA fractions as to their 
sequence (RNA-seq), sites of initiation of transcription (cap-analysis 
of gene expression (CAGE)’) and sites of 5’ and 3’ transcript termini 
(paired end tags (PET)'°; Supplementary Fig. 1). Sequence reads were 
mapped and post-processed using a variety of software tools (Sup- 
plementary Table 2 and Supplementary Fig. 2). We used the mapped 
data to assemble and quantify de novo elements (exons, transcripts, 
genes, contigs, splice junctions and transcription start sites (TSSs)) as 
well as to quantify annotated GENCODE (v7) elements. Elements 
and quantifications were further assessed for reproducibility between 
replicates using a non-parametric version (npIDR, Supplementary 
Information) of the irreproducible detection rate (IDR) statistical 
test’. Only elements deemed to be reproducible with at least 90% 
likelihood were used in most analyses. The raw data, mapped data 
and elements were then made available by the ENCODE Data 
Coordination Center (DCC, http://genome.ucsc.edu/ENCODE/ 
dataSummary.html) (Supplementary Fig. 2). These data, as well as 
additional data on all intermediate processing steps, are available on 
the RNA Dashboard (http://genome.crg.cat/encode_RNA_dashboard/). 


Long RNA expression landscape 
Detection of annotated and novel transcripts 
The GENCODE gene (Supplementary Fig. 3a) and transcript 
(Supplementary Fig. 3b) reference annotation® captures our current 
understanding of the polyadenylated human transcriptome. In the 
samples interrogated here, we cumulatively detected 70% of anno- 
tated splice junctions, transcripts and genes (Fig. 1 and Table 1a). We 
also detected approximately 85% of annotated exons with an average 
coverage by RNA-seq contigs of 96%. The variation in the proportion 
of detected elements among cell lines was small (Fig. 1, width of box 
plots). Consistent with earlier studies, most annotated elements are 
present in both polyadenylated (Supplementary Table 3a) and non- 
polyadenylated (Supplementary Table 3b) samples'”"'. Only a small 
proportion of GENCODE elements (0.4% of exons, 2.8% of splice 
sites, 3.3% of transcripts and 4.7% of genes) are detected exclusively 
in the non-polyadenylated RNA fraction. 

Beyond the GENCODE annotated elements, we observed a 
substantial number of novel elements represented by reproducible 
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Figure 1 | A large majority of GENCODE elements are detected by RNA-seq 
data. Shown are GENCODE-detected elements in the polyadenylated and 
non-polyadenylated fractions of cellular compartments (cumulative counts for 
both RNA fractions and compartments refer to elements present in any of the 
fractions or compartments). Each box plot is generated from values across all 
cell lines, thus capturing the dispersion across cell lines. The largest point shows 
the cumulative value over all cell lines. 


RNA-seq contigs. These novel elements covered 78% of the intronic 
nucleotides and 34% of the intergenic sequences (Supplementary Fig. 4). 
Overall, the unique contribution of each cell line to the coverage of the 
genome tends to be small and similar for each cell line (Supplementary 
Fig. 5). We used the Cufflinks algorithm (see Supplementary Informa- 
tion), and predicted over all long RNA-seq samples 94,800 exons, 69,052 
splice junctions, 73,325 transcripts and 41,204 genes in intergenic and 
antisense regions (Table 1b). These novel elements increase the 
GENCODE collection of exons, splice sites, transcripts and genes by 
19%, 22%, 45% and 80%, respectively. The increase in the number of 
genes and the relatively low contribution of novel splice sites is primarily 
caused by the detection of both polyadenylated and non-polyadenylated 
mono-exonic transcripts (Supplementary Table 3). Detection of 
unspliced transcripts could partially be an artefact caused by low levels 
of DNA contamination or by incomplete determination of transcript 
structures. 

Independent validation of multi-exonic transcript models and the 
associated predicted coding products were carried out using overlapping 
targeted 454 Life Sciences (Roche) paired-end reads and mass spectro- 
metry. Of approximately 3,000 intergenic and antisense transcript 
models tested, validation rates from 70% to 90% were observed, depend- 
ing on the number of reads and IDR score. In addition, these experi- 
ments led to the identification of more than 22,000 novel splice sites not 
previously detected, meaning an almost eightfold increase in detection 
compared to the sites originally detected with RNA-seq (Supplementary 
Fig. 6). Using mass spectrometric analyses, we investigated what fraction 
of the novel Cufflinks transcript models show evidence consistent with 
protein expression. We produced 998,570 spectra from two cell lines 
(K562 and GM12878; J. Khatun et al., manuscript in preparation), and 
mapped them to a three-frame translation of the novel Cufflinks models 
(Supplementary Material). At a 1% false discovery rate (FDR), we iden- 
tified 419 novel models with 5 or more spectral and/or 2 or more peptide 
hits, of which only 56 were intergenic or antisense to GENCODE genes 
(Supplementary Table 4 and Supplementary Fig. 7). Thus, most novel 
transcripts seem to lack protein-coding capacity. 


The transcriptome of nuclear subcompartments 
For the K562 cell line, we also analysed RNA isolated from three 
subnuclear compartments (chromatin, nucleolus and nucleoplasm; 
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Supplementary 5). Almost half (18,330) of the GENCODE (v7) anno- 
tated genes detected for all 15 cell lines (35,494) were identified in the 
analysis of just these three nuclear subcompartments. In addition, 
there were as many novel unannotated genes found in K562 subcom- 
partments as there were in all other data sets combined (Supplemen- 
tary Table 5 and Table 1b). For all annotated (Supplementary 
Table 5.1) or novel (Supplementary Table 5.2) elements, only a small 
fraction in each subcompartment was unique to that compartment 
(Supplementary Table 6). 

The interrogation of different subcellular RNA fractions provides 
snapshots of the status of the RNA population along the RNA proces- 
sing pathway. Thus, by analysing short and long RNAs in the different 
subcellular compartments, we confirm that splicing predominantly 
occurs during transcription. By using RNA-seq to measure the degree 
of completion of splicing (Fig. 2a), we observed that around most 
exons, introns are already being spliced in chromatin-associated 
RNA—the fraction that includes RNAs in the process of being 
transcribed (Fig. 2b). Concomitantly, we found strong enrichment 
specifically of spliceosomal small nuclear RNAs (snRNAs) in this 
RNA fraction (see “Short RNA expression landscape’ later). Co- 
transcriptional splicing provides an explanation for the increasing 
evidence connecting chromatin structure to splicing regulation, and 
we have observed that exons in the process of being spliced are 
enriched in a number of chromatin marks’®*"”. 


Gene expression across cell lines 

The analyses of RNAs isolated from different subcellular compart- 
ments also provide information concerning compartment-specific 
relative steady-state abundance and the post transcriptional proces- 
sing state (spliced/unspliced, polyadenylated/non-polyadenylated, 
5’ capped/uncapped) for each of the detected transcripts. The 
observed range of gene expression spans six orders of magnitude 
for polyadenylated RNAs (from 10 * to 10* reads per kilobase per 
million reads (r.p.k.m.)), and five orders of magnitude (from 10 “to 
10° r.p.k.m.) for non-polyadenylated RNAs (Fig. 3 and Supplemen- 
tary Fig. 8a). The distribution of gene expression is very similar across 
cell lines, with protein-coding genes, as a class, having on average 
higher expression levels than long non-coding RNAs (IncRNAs). 
Assuming that 1-4 r.p.k.m. approximates to 1 copy per cell’*, we find 
that almost one-quarter of expressed protein-coding genes and 80% of 
the detected IncRNAs are present in our samples in 1 or fewer copies 
per cell. The general lower level of gene expression measured in 
IncRNAs may not necessarily be the result of consistent low RNA 
copy number in all cells within the population interrogated, but 
may also result from restricted expression in only a subpopulation 
of cells. In some cell lines, individual IncRNAs can exhibit steady-state 
expression levels as high as those of protein-coding genes. This is, for 
example, seen in the expression of the protein-coding gene actin, 
gamma 1 (ACTG1), and the non-coding gene, H19 (Fig. 3). ACTG1 
transcripts are part of all non-muscle cytoskeleton systems within 
cells and show a steady-state expression level at the population level 
that is at least 1-2 logs greater than H19, a cytosolic non-coding RNA 
(ncRNA). However, when measured at the individual transcript level, 
expression of IncRNA transcripts is comparable to that of individual 
protein-coding transcripts (Supplementary Fig. 8b). 

Novel antisense and intergenic genes predicted in this study com- 
prise a third clustering of RNAs with levels of expression ranging from 
10 *to10 'r.p.k.m. Asa class, only protein-coding genes seem to be 
enriched in the cytosol, making the nucleus a centre for the accumula- 
tion of ncRNAs (Fig. 3). Other gene classes, such as pseudogenes and 
small annotated ncRNAs, also show subcellular compartmental 
enrichment (Supplementary Fig. 9). 

Higher variability and lower pairwise correlation of expression 
across all cell lines is consistent with IncRNAs contributing more to 
cell-line specificity than protein-coding genes. Indeed, a considerable 
fraction (29%) of all expressed IncRNAs are detected in only one of the 
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Figure 2 | Co-transcriptional splicing. a, Short read mappings for exon-based 
splicing completion. Read mappings that allow assessment of splicing completion 
around exons are shown. Reads providing evidence of splicing completion for the 
region containing the exon (with either exon inclusion (a, b) or exclusion (c)) are 
shown. Reads providing evidence for the splicing of the region containing the 
exon not being completed yet are indicated by d and e. The complete splicing 
index (coSI) is the ratio of (0.5(a + b) + c) over (0.5(a + b) +c + 0.5(d + e)) and 
can thus be broadly assumed to correspond to the fraction of RNA molecules in 
which the region containing the exon has already been spliced (see ref. 16). A coSI 
value of 1 means splicing completed, whereas a value of 0 indicates that splicing 
has not yet been initiated. b, Distribution of coSI scores computed on GENCODE 
internal exons. Top: distribution in the total chromatin RNA fraction. Bottom: 
distribution in cytosolic polyadenylated RNA fraction. 


cell lines studied when considering the whole cell polyadenylated 
RNAs, whereas only 10% were expressed in all cell lines. Con- 
versely, whereas a large fraction (53%) of expressed protein-coding 
genes were constitutive (expressed in all cell lines), only ~7% were 
cell-line specific (Supplementary Table 7 and Supplementary Fig. 10). 


Patterns of splicing 

The analysis of the expression of alternative isoforms resulted in 
several observations. First, isoform expression does not seem to follow 
a minimalistic strategy. Genes tend to express many isoforms simul- 
taneously, and as the number of annotated isoforms per gene grows, 
so does the number of expressed isoforms (Fig. 4a). The increase, 
however, is not linear and seems to plateau at about 10-12 expressed 
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Figure 3 | Abundance of gene types in cellular compartments. Two- 
dimensional kernel density plots of nuclear over cytosolic enrichment (y axis) 
versus overall gene expression in the whole cell extract (x axis), for protein 
coding, long non-coding and novel genes over all cell lines. Only genes present 


isoforms per gene. However, we cannot obviously distinguish whether 
this is the result of multiple isoforms expressed in the same cell or of 
different isoforms expressed in different cells within the interrogated 
population. Second, alternative isoforms within a gene are not 
expressed at similar levels, and one isoform dominates in a given 
condition—usually capturing a large fraction of the total gene 
expression (at least 30%, even for genes with many isoforms; 
Fig. 4b). Third, about three-quarters of protein-coding genes have 
at least two different dominant/major isoforms depending on the cell 
line (Supplementary Fig. 11a). Fourth, the number of major isoforms 
per gene grows with the number of annotated isoforms; indeed, the 
proportion of genes with n isoforms that express only one major 
isoform is strikingly proportional to 1/n (Supplementary Fig. 11b). 
Fifth, variability of gene expression contributes more than variability 
of splicing ratios to the variability of transcript abundances across cell 
lines (Supplementary Information). 


Alternative transcription initiation and termination 
On the basis of RNA-seq analysis of polyadenylated RNAs, a total of 
128,021 TSSs were detected across all cell lines, of which 97,778 were 
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in all three RNA extracts are displayed, as well as two representative genes 
(ACTGI in red and H19 in blue), for which the expression in each individual 
cell line is shown. The actual values of the estimated kernel density are indicated 
by contour lines and colour shades. 


previously annotated and 30,243 were novel intergenic/antisense 
TSSs (Supplementary Table 3a). CAGE tags, filtered by a hidden 
Markov model (HMM)-based algorithm to differentiate between 5’ 
capped termini of polymerase II transcripts and recapping events’ 
(Supplementary Information), identified a total of 82,783 non- 
redundant TSSs (Supplementary Table 8). Approximately 48% of 
the CAGE-identified TSSs are located within 500 base pairs (bp) of 
an annotated RNA-seq-detected GENCODE TSS, whereas an addi- 
tional 3% are within 500 bp of a novel TSS (Supplementary Fig. 12). 
Notably, only ~72% of all CAGE sequencing reads map to TSSs, 
indicating that the remaining 30% may originate from recapping 
events or from a new class of TSS. 

Using data collected within the ENCODE consortium”’, we carried 
out a comparison of the GENCODE/RNA-seq and CAGE-determined 
TSSs and correlated them to chromatin and DNA features characteristic 
of initiation of transcription, such as DNase hypersensitivity”, chro- 
matin modification and DNA binding elements’. All GENCODE/ 
RNA-seq-determined TSSs were examined in each of the cell lines 
(Supplementary Fig. 13, column 1). Of these redundant positions, 
44.7% (199,146) of the RNA-seq-supported TSSs also displayed 


Figure 4 | Isoform expression within a gene. 

a, Number of expressed isoforms per gene per cell 
line. Genes tend to express many isoforms 
simultaneously. b, Relative expression of the most 
abundant isoform per gene per cell line. There is 
generally one dominant isoform in a given 
condition. The whiskers are defined as Q1 

—1.5 X IQR to Q3 +1.5 X IQR, where IQR is the 
interquartile range, and Q1 and Q3 the first and 
third quartile, respectively. Each box plot was 
constructed using the number of genes with 1, 2, 3, 
4, etc. up to 25 isoforms. 
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evidence of CAGE. Approximately half of these TSS positions are assoc- 
iated with at least one of the other characteristic features of transcription 
initiation (DNase I, H3K27ac and H3K4me3 chromatin modifications). 
Thus, only a small minority of the TSSs identified by either CAGE or 
RNA-seq/GENCODE displayed all of the characteristics of the start of 
transcription (presence of DNase I, H3K4me3, H3K27ac sites and either 
TAFI or TBP binding). This is consistent with the possibility that reg- 
ulatory regions proximal to TSSs are of more than one type. 

At the 3’ end, a total of 128,824 sites mapping within annotated 
GENCODE transcripts were identified as potential sites of polyade- 
nylation after trimming unmapped RNA-seq reads with long terminal 
polyadenine stretches**. About 20% of these mapped proximal to 
annotated polyadenylation sites (PAS) whereas the remaining 80% 
correspond to novel PAS of annotated genes, raising the average 
number of PAS per gene from 1.1 to 2.5. Generally, we observed a 
cell-type preference for proximal PAS (closest to the annotated stop 
codon) in the cytosol compared to the nucleus (Supplementary 
Information). 


Short RNA expression landscape 

Annotated small RNAs 

Currently, a total of 7,053 small RNAs are annotated by GENCODE, 
85% of which correspond to four major classes: small nuclear 
(sn)RNAs, small nucleolar (sno)RNAs, micro (mi)RNAs and transfer 
(t)RNAs (Table 2a). Overall we find 28% of all annotated small RNAs 
to be expressed in at least one cell line (Table 2a). The distribution of 
annotated small RNAs differs markedly between cytosolic and 
nuclear compartments (Supplementary Fig. 14a). We found that the 
small RNA classes were enriched in those compartments where they 
are known to perform their functions: miRNAs and tRNAs in the 
cytosol, and snoRNAs in the nucleus. Interestingly, snRNAs were 
equally abundant in both the nucleus and the cytosol. When specif- 
ically interrogating the subnuclear compartments of the K562 cell 
line, however, snRNAs seem to be present in very high abundance 
in the chromatin-associated RNA fraction (Supplementary Fig. 14b, c). 
This striking enrichment is consistent with splicing being predomi- 
nantly co-transcriptional’®”. 


Unannotated short RNAs 
We detected two types of unannotated short RNAs. The first type 
corresponds to subfragments of annotated small RNAs. Because we 
performed 36-nucleotide end-sequencing of the small RNA fraction, 
we expected RNA-seq reads to map to the 5’ end of the small RNAs. 
Supplementary Figure 15 shows the mapping profile of reads along 
small RNA genes. In both the nuclear and cytosolic compartments, we 
indeed detected accumulation of reads at the start of snoRNAs and at 
the guide and passenger sequences of annotated miRNAs. For 
snRNAs, however, we observed three prominent peaks: the expected 
one at the 5’ end and two smaller ones at the middle and at the 3’ end of 
the gene, indicating fragmentation of some snRNAs. Finally, tRNAs 
seem not to have any prominent sets of 5’ end fragments present at 
levels greater than what is seen at the annotated 5’ termini. Whereas 
subfragments of mature tRNAs have been reported previously, these 
reports were confined to distinct alleles of only a few tRNA genes”**”*. 
The second and largest source of unannotated short RNAs corre- 
sponds to novel short RNAs (Table 2b) that map outside of annotated 
ones. Almost 90% of these are only observed in one cell line and are 
present at low copy numbers. Nearly 40% of these unannotated 
short RNAs are associated with promoter and terminator regions of 
annotated genes (promoter-associated short RNAs (PASRs) and 
termini-associated short RNAs (TASRs)), and their position relative 
to TSSs and transcription termination sites is similar to previous results*. 


Genealogy of short RNAs 
Genome wide, 27% of annotated small RNAs reside within 8% of 
protein-coding and 5% within 3% of IncRNA genes (Supplementary 
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Fig. 16). Overall, about 6% of all annotated long transcripts overlap with 
small RNAs and are probably precursors to these small RNAs. Although 
most of these small RNAs reside in introns, when controlling for relative 
exon/intron length, we found that exons from IncRNAs are compara- 
tively enriched as hosts for snoRNAs (Supplementary Fig. 17a). 
Additionally, 8.4% of GENCODE annotated small RNAs map within 
novel intergenic transcripts, with most overlapping annotated tRNAs. 
The enrichment for tRNAs was mostly in novel intergenic transcripts 
derived from non-polyadenylated RNAs (Supplementary Fig. 17b). 
Many long RNAs, both novel and annotated, thus seem to have dual 
roles, as functional (protein coding) RNAs, and as precursors for many 
important classes of small RNAs. Using RNA-seq data from the K562 
cell line, we investigated the preferential cellular localization of these 
RNA precursors (Supplementary Fig. 18). For mature miRNAs and 
tRNAs (cytosolic enrichment), the potential RNA precursors, iden- 
tified as RNA-seq contigs overlapping the small RNAs, were detected 
to be predominantly nuclear (Supplementary Fig. 18a, d). Notably, 
whereas mature snRNAs were both nuclear and cytosolic, the overlap- 
ping long RNAs were observed to be primarily nuclear (Supplementary 
Fig. 18c). Finally, for snoRNAs (nuclear enrichment), potential long 
RNA precursors were decidedly observed to be both nuclear and 
cytosolic (Supplementary Fig. 18b). Unannotated short RNAs were 
found overall not to be enriched in either the nuclear or cytosolic 
compartment (Supplementary Fig. 18e). 


RNA editing and allele-specific expression 

The sequence of transcripts can differ from the underlying genomic 
sequence as the result of post-transcriptional editing. We developed a 
pipeline to filter sequencing artefacts and identify genes that are RNA 
edited”. Focusing first on GM12878, a cell line that has been deeply 
re-sequenced, we find a total 51,557 RNA consistent single nucleotide 
variants (SNVs) within genic boundaries, 65% of which are present in 
dbSNP. Of the remainder, 1,186 SNVs in 430 genes (Supplementary 
Fig. 19a) survive our most stringent filters and 88% of these are 
candidate adenosine to inosine A>G(I) changes. Notably, the next 
highest frequency of SNVs is for T>C (5%) and these occur primarily 
in regions with detectable antisense transcription”. We find similar 
A>G(I) frequencies of 75-84% in seven additional cell lines 
(Supplementary Fig. 19b). The remaining non-canonical edits amount 
to very few events in each cell line and are relatively evenly distributed 
(G>A is the third highest). These results do not support a recent report 
of a substantial number of non-canonical SNV edits in the RNA of 
human lymphoblastoid cells”. 

Using the AlleleSeq pipeline*’ on the SNPs in the GM12878 genome, 
we found that approximately 18% of both GENCODE annotated 
protein-coding and long non-coding genes exhibit allele-specific 
expression. The proportion of genes with allele-specific expression 
was similar in the three investigated RNA fractions (whole-cell, 
cytoplasm and nucleus; Supplementary Table 9 and Supplementary 
Information). 


Repeat region transcription 


About 18% (14,828) of CAGE-defined TSS regions overlap repetitive 
elements. More precisely, we find 322, 315, 507 and 1,262 intergenic 
CAGE clusters overlapping long interspersed element (LINE), short 
interspersed element (SINE), long terminal repeat (LTR) and other 
repeat elements, respectively (see Supplementary Information). 
Measuring Shannon entropy across cell lines, we found that CAGE 
clusters mapping to repeat regions were noticeably more narrowly 
expressed than CAGE clusters mapping within genic regions 
(Supplementary Fig. 20a). We represented the correlation of levels 
of expression compared to cell types as heat maps drawn separately 
for each of the three repeat element families (LINE, SINE and LTR) 
(Supplementary Fig. 20b-d). Although a large proportion of the tran- 
scripts in the human genome is thought to be initiated from repetitive 
elements (especially retrotransposon elements”), these data clearly 


6 SEPTEMBER 2012 | VOL 489 | NATURE | 105 


©2012 Macmillan Publishers Limited. All rights reserved 


ARTICLE 


point to cell-line specificity as the main characteristic of transcripts 
emanating from repeat regions. 


Characterization of enhancer RNA 


It has recently been reported that RNA polymerase II binds some 
distal enhancer regions and can produce enhancer-associated tran- 
scripts named eRNA****. We used our RNA assays to detect and 
characterize transcriptional activity at enhancer loci predicted 
genome-wide from ENCODE chromatin immunoprecipitation and 
high-throughput sequencing (ChIP-seq) data”®”®. 

Figure 5a shows the aggregate pattern of RNA-seq and CAGE 
signal in a strand-specific manner around the subset of predicted 
gene-distal enhancers containing DNase I hypersensitive sites and 
centred on those sites. In these plots, as denoted by the accumulation 
of CAGE tags signifying TSSs, transcription initiation within the 
enhancer region is observed, and continues outwards for several 
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Figure 5 | Transcription at enhancers. a, The pattern of RNA elements 
around enhancer predictions*””** containing DNase I hypersensitive sites. The 
lines represent the average frequency of RNA elements (top, polyadenylated 
long RNA contigs; middle, CAGE tag clusters; bottom, non-polyadenylated 
long RNA contigs) in a genomic window around the centre of the enhancer 
prediction as determined by DNase I hypersensitive sites. Elements on the plus 
strand are shown in red, and on the minus strand in blue. b, Enhancer 
transcripts differ from promoter transcripts. The box plots compare the 
features of transcripts at predicted enhancer loci compared to predicted novel 
intergenic promoters” and annotated promoters*. H3K4me3, poly(A)” and 
nucleus denote the three following ratios: H3K4me3/(H3K4me3 + H3K4mel1), 
polyadenylated/(polyadenylated + non-polyadenylated), nuclear/(nuclear + 
cytosolic). Enhancers are marked by higher levels of H3K4me1 compared to 


106 | NATURE | VOL 489 | 6 SEPTEMBER 2012 


Enrichment 


kilobases (kb). This behaviour can be observed for the polyadenylated 
and non-polyadenylated RNA fractions mapping in both intronic 
and intergenic regions. As previously reported**, we observe a large 
diversity of expression levels at each of the transcribed enhancers. 
Polyadenylated to non-polyadenylated RNA ratios, as well as nuclear 
to cytoplasmic ratios, vary at individual enhancers (Supplementary Fig. 
21a, b). However, contrary to some previous reports, although most 
eRNAs are prevalent in the nuclear non-polyadenylated RNA fraction, 
some eRNAs seemed to be polyadenylated in the nucleus. This pattern 
was significantly different compared to transcripts from GENCODE 
annotated and novel predicted”? promoters (Fig. 5b). 

Transcribed enhancers on average show a significantly different 
pattern of chromatin modification than non-transcribed ones*”*°. 
The enhancer regions displayed stronger signals for H3K4 methyla- 
tion, H3K27 acetylation and H3K79 dimethylation along with 
higher levels of RNA polymerase II binding, all associated with 
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H3K4me3 than novel or annotated promoters (left). Enhancer transcripts show 
higher levels of non-polyadenylated (middle) and nuclear (right) RNA relative 
to promoters. c, Chromatin state at transcribed enhancers. Enhancer 
predictions with evidence of transcription (in blue; Cage tags present at 
predicted locus) show a different pattern of histone modification and higher 
levels of RNA polymerase II binding than non-transcribed predictions (red). 
They are enriched for H3K27 acetylation, H3K4 methylation, H3K79 
dimethylation and depleted for H3K27 trimethylation. d, Enhancer activity and 
transcription is cell-type specific. Loci predicted to be active transcribed 
enhancers in GM12878 cells show low signal for CAGE tags (top) and for 
H3K27 acetylation (bottom) in other cell lines. The whiskers are defined as Q1 
—1.5 X IQR to Q3 +1.5 X IQR, where IQR is the interquartile range, and Q1 
and Q3 the first and third quartile, respectively. 
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Figure 6 | Size distribution of intergenic regions. Novel genes increase the 
proportion of small intergenic regions. 


transcriptional initiation and elongation (Fig. 5c). Both the transcripts 
and the chromatin states are cell-type specific (Fig. 5d). Taking the 
GM 12878 cell line as an example, the enhancer loci producing eRNA 
demonstrate enrichment of CAGE tag detection (Fig. 5d, top) and the 
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presence of H3K27ac histone modification (Fig. 5d, bottom) in this 
cell line compared to five other analysed cell lines. This strongly 
suggests that the regulatory regions governing the expression of 
enhancer transcripts are distinguished from regulatory regions 
located at the beginning of genic regions. 


Concluding remarks 

The cumulative coverage of transcribed regions in the 15 cell lines 
across the human genome is 62.1% and 74.7% for processed and 
primary transcripts, respectively (Supplementary Table 10 and 
Supplementary Fig. 22). On average, for each cell line, 39% of the 
genome is covered by primary transcripts and 22% by processed 
RNAs. No cell line showed transcription of more than 56.7% of the 
union of the expressed transcriptomes across all cell lines. When 
mapping the current RNA-seq data to the ENCODE pilot regions 
(Supplementary Table 10), we observed a similar, albeit higher, extent 
of transcriptional coverage of 73.3% for processed RNAs and 84.5% 
for primary transcripts. Previously reported estimates in these regions 
for processed and primary transcripts were 24% and 93%, respectively 
(Supplementary Table 2.4.3 and ref. 3). The increased genome 
coverage by processed RNAs stems largely from the inclusion of 


Expression of GENCODE (v7) annotated elements (a) 


Gene type Detected exonst Detected splice Detected transcriptst+ Detected genes} Exon Number of Number of Proportion Number of Proportion 
(annotation no.) junctions# (annotation (annotation no.) (annotation no.) nucleotide genes genes over genes genes over genes 
no.) coveraget expressed expressed expressed§ expressed expressed|| 
(%) in atleast in only one (%) in 14 cell (%) 
one cell line cell line lines 
Long non-coding 22,381 (41,467) 8,017 (26,872) 6,521 (14,880) 5,906 (9,277) 87.5 5906 1386 23.5 631 107 
Protein coding 288,322 (318,514) 194,752 (244,158) 59,822(76,006) 18,939(20,679) 98.1 18,939 1,082 5.7 10,571 55.8 
Other* 102,000 (133,937) 19,277 (47,663) 45,410 (71,113) 10,649(21,750) 95.2 10,649 2,453 23.0 1,896 178 
Total annotated 412,703 (493,918) 222,046 (318,693) 111,753 (161,999) 35,494(51,706) 96.7 35,394 4,921 13.9 13,098 37.0 
Expression of GENCODE (v7) intergenic and antisense elements (b) 
Category Detected exonst Detected splice Detected transcriptst Detected genest 
junctiont 
Mono-exonic 55,683 NA 55,682 33,686 
Multi-exonic 39,117 69,052 17,643 7,518 
Total 94,800 69,052 73,325 41,204 
NA, not applicable. 
*Includes pseudogenes, miRNAs, etc. 
+All elements that passed np!DR (0.1). 
¢ Cumulative detected nucleotide in detected exons/total nucleotides in detected exons. 
§ Proportion for genes expressed in only one cell line. 
|| Proportion for genes expressed in 14 cell lines. 
Table 2 | Short RNAs 
Expression of GENCODE (v7) annotated small RNA genes (a) 
Gene type* GENCODE total Detected genes No. genes expressed in No. genes expressed in miRNA guide miRNA passenger Internal fragments|| of 
(% detected) only one cell line (% 12 cell lines (% detected) fragmentt fragments annotated small RNA 
detected) (average per detected gene) 
miRNA 1,756 497 (28) 59 (12) 147 (30) 454 (454) 175 (175) 18 
snoRNA 1,521 458 (30) 73 (16) 223 (49) NA A 60 
snRNA 1,944 378 (19) 123 (33) 41 (11) NA A 36 
tRNA 624 465 (75) 29 (6) 197 (42) NA A 52 
Other+ 1,209 191 (16) 69 (36) 24 (13) NA A 32 
Total GENCODE 7,054 1,989 (28) 353 (18) 632 (32) NA A 40 
Expression of unannotated short RNAs (b) 
Cell compartment Unannotated Exonic Intronic Exon-intron boundaries Genic Gene-intergene Intergenic 
short RNAs oundaries 
Cell 57,393 14,116 13,773 1,818 29,707 13,048 25,906 
Nucleus 82,297 19,334 40,136 5,248 64,718 7A17 16,289 
Cytosol 25,455 6,183 5,605 665 12,453 6,631 12,447 
Threecompartments 150,165 38,969 55,061 7,552 101,582 23,185 45,081 
NA, not applicable. 
*Includes all other GENCODE small transcript biotypes except for pseudogenes. 
+All elements that have passed np!DR (0.1). 
¢Number of detected miRNAs with an expressed annotated guide (with an annotated guide in mirbase). 
§ Number of detected miRNAs with an expressed annotated passenger (with an annotated passenger in mirbase). 


|| Short RNA-seq mapping for which the 5’ end starts 5 bp after the start and ends 5 bp before the end 


of a detected gene. 
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non-polyadenylated RNAs in the current study. Other than that, 
given the differences in the samples studied, the selection of pilot 
regions with high genic content, the increase of annotated genomic 
regions over time, and the different technologies used to interrogate 
transcription, both estimates are in reasonable agreement. 

As a consequence of both the expansion of genic regions by the 
discovery of new isoforms and the identification of novel intergenic 
transcripts, there has been a marked increase in the number of inter- 
genic regions (from 32,481 to 60,250) due to their fragmentation anda 
decrease in their lengths (from 14,170 bp to 3,949 bp median length; 
Fig. 6). Concordantly, we observed an increased overlap of genic 
regions. As the determination of genic regions is currently defined 
by the cumulative lengths of the isoforms and their genetic association 
to phenotypic characteristics, the likely continued reduction in the 
lengths of intergenic regions will steadily lead to the overlap of most 
genes previously assumed to be distinct genetic loci. This supports 
and is consistent with earlier observations of a highly interleaved 
transcribed genome’’, but more importantly, prompts the reconsid- 
eration of the definition of a gene. As this is a consistent characteristic 
of annotated genomes, we would propose that the transcript be con- 
sidered as the basic atomic unit of inheritance. Concomitantly, the 
term gene would then denote a higher-order concept intended to 
capture all those transcripts (eventually divorced from their genomic 
locations) that contribute to a given phenotypic trait. Co-published 
ENCODE-related papers can be explored online via the Nature 
ENCODE explorer (http://www.nature.com/ENCODE), a specially 
designed visualization tool that allows users to access the linked 
papers and investigate topics that are discussed in multiple papers 
via thematically organized threads. 


METHODS SUMMARY 
For full details of Methods, see Supplementary Information. 
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The long-range interaction landscape of gene 


promoters 
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The vast non-coding portion of the human genome is full of 
functional elements and disease-causing regulatory variants. The 
principles defining the relationships between these elements and 
distal target genes remain unknown. Promoters and distal ele- 
ments can engage in looping interactions that have been implicated 
in gene regulation’. Here we have applied chromosome conforma- 
tion capture carbon copy (5C’) to interrogate comprehensively 
interactions between transcription start sites (TSSs) and distal ele- 
ments in 1% of the human genome representing the ENCODE pilot 
project regions’. 5C maps were generated for GM12878, K562 and 
HeLa-S3 cells and results were integrated with data from the 
ENCODE consortium’. In each cell line we discovered >1,000 
long-range interactions between promoters and distal sites that 
include elements resembling enhancers, promoters and CTCF- 
bound sites. We observed significant correlations between gene 
expression, promoter-enhancer interactions and the presence of 
enhancer RNAs. Long-range interactions show marked asymmetry 
with a bias for interactions with elements located ~120 kilobases 
upstream of the TSS. Long-range interactions are often not blocked 
by sites bound by CTCF and cohesin, indicating that many of 
these sites do not demarcate physically insulated gene domains. 
Furthermore, only ~7% of looping interactions are with the 
nearest gene, indicating that genomic proximity is not a simple 
predictor for long-range interactions. Finally, promoters and 
distal elements are engaged in multiple long-range interactions 
to form complex networks. Our results start to place genes and 
regulatory elements in three-dimensional context, revealing their 
functional relationships. 

Spatial proximity and specific long-range interactions between 
genomic elements can be detected using chromosome conformation 
capture (3C)-based methods’. Previous studies have been limited 
to analysis of single loci**, interactions that involve a single 
protein of interest’, or to analysis of genome-wide folding of 
chromosomes at a resolution that cannot detect specific looping 
interactions between genes and functional elements’®. To overcome 
these limitations we previously developed 5C (ref. 2). 5C is a high- 
throughput adaptation of 3C and uses pools of reverse and forward 5C 
primers to detect long-range interactions between two targeted sets of 
genomic loci, for example, promoters and distal gene regulatory 
elements in this study. By targeting a specific part of the genome, 
5C facilitates detection of interactions at single restriction fragment 
resolution. 

To begin to define the principles of long-range gene regulation in 
the human genome we have used 5C to map interactions systematically 
between promoters and distal elements throughout the 44 ENCODE 
pilot project regions representing 1% (30megabases (Mb), Sup- 
plementary Table 1) of the genome in three cell lines (Fig. 1a). The 
ENCODE regions, ranging in size from 500 kilobases (kb) to 1.9 Mb, 
were selected for comprehensive annotation by the ENCODE pilot 
project’’. Here we analysed interactions between 628 TSS-containing 


restriction fragments 
and 4,535 ‘distal’ restric- 
tion fragments covering 
the ENCODE regions 
(Fig. la and Supplemen- 


ENCODE 


Encyclopedia of DNA Elements 
nature.com/encode 


tary Tables 2 and 3; see also Methods). 

5C libraries were generated for two biological replicates of 
GM12878, K562 and HeLa-S3 (Supplementary Tables 4-6). These cell 
lines are extensively annotated by the ENCODE consortium**. 5C 
interaction frequencies measured between ENCODE regions located 
on different chromosomes were used to quantify minor variations in 
interaction detection efficiencies due to technical biases related to 5C 
primer efficiency, restriction fragment length, or digestion efficiency. 
5C interaction frequencies were then corrected for these biases 
(Methods and Supplementary Data). 

An example of a 5C long-range interaction map representing 
TSS-distal fragment interactions along and between 14 ENCODE 
regions (ENm001-ENm014) is shown in Fig. 1b. 5C detects known 
general features of spatial chromatin organization. First, interactions 
within the same ENCODE region are more frequent than those 
between different ENCODE regions. Within one ENCODE region 
interaction frequencies are generally higher for pairs of loci located 
closer together in the linear genome. This inverse relationship between 
genomic distance and interaction frequency is as expected for a flexible 
chromatin fibre*'*. Second, interactions between ENCODE regions 
that are located on the same chromosome are more frequent than 
interactions between regions located on different chromosomes 
(arrow in Fig. 1b). This is consistent with 4C and Hi-C analyses", 
and is due to the formation of spatially separated chromosome 
territories. 

5C data sets were analysed to identify TSS—distal fragment pairs that 
interact more frequently than expected, indicating that they are rela- 
tively close in space. For each biological replicate we independently 
determined the average relationship between interaction frequency 
and genomic distance (solid red lines in Fig. 1c, d). We defined this 
as the expected interaction frequency. Next we identified interactions 
that occur significantly more frequently than expected for loci sepa- 
rated by a corresponding genomic distance by transforming 5C signals 
into a z-score (false discovery rate (FDR) = 1%; Methods). Specific 
long-range interactions are then defined as pairs of loci that interact 
significantly more frequently than expected in both replicates. By 
excluding interactions that are significant in only one replicate, we 
estimate that only around 10-18% of the significant long-range inter- 
actions identified by our approach might be false positives, as esti- 
mated from analysis of interactions in gene desert ENCODE regions 
(ENr112, ENr113 and ENr313) where no significant long-range inter- 
actions were expected (Methods). This application of stringent thresh- 
olds probably leads to a higher false-negative rate. Consistently, 
interaction frequencies that are found to be significant in only 
one replicate are still significantly elevated in the other replicate as 
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Figure 1 | 5C approach to identify looping interactions. a, 5C design”. 
Reverse 5C primers were designed for HindIII fragments that contain a TSS 
(red; according to the GENCODE v7”°) and forward 5C primers for all other 
‘distal’ HindIII fragments (blue). b, Heat map of all interrogated TSS-—distal 
fragment interactions in 14 ENCODE regions (ENm001-ENm014) in K562 
cells. Fragments are displayed in their genomic order. Each dark rectangular 
area in the heat map denotes interactions within a single ENCODE region 
whereas remaining areas denote interactions between regions. ENCODE 
regions that are on the same chromosome show a higher interaction frequency 
(arrow) than regions that were on different chromosomes. c, d, Examples of 5C 
interaction profiles for two TSSs indicated by vertical orange bars (left, ACSL6 


compared to interactions that are never significant, but are just below 
the chosen 1% FDR threshold (Supplementary Fig. 1). 

Our analysis correctly identified known interactions between TSSs 
and their cognate distal regulatory elements, providing validation of 
the approach (Supplementary Fig. 3). As an example, Fig. 1d shows the 
5C interaction profile in K562 cells for a TSS located in the B-globin 
locus. We previously found that this TSS located just downstream of 
the y-globin genes displayed prominent looping interactions with 
the distal locus control region (LCR) in K562 cells’. Our analysis 
accurately detected these looping interactions (HS3, HS4 and HSS). 
We identified additional known long-range interactions with DNase I 
hypersensitive sites (DHSs) near distal CTCF-bound elements (3'HS1 
and HS-111)*'*"*. In K562 cells we also detected the known interac- 
tions between the y-globin gene (HBG1) and the LCR (HS5) and 
between the a-globin genes and three distal regulatory elements 
including the o-globin enhancer HS40, and two CTCF-bound 
elements (HS46 and HS10), located 40, 46 and 10kb upstream of 
the genes, respectively (Supplementary Fig. 3 and refs 15, 16). The 
importance of these distal elements in regulating globin gene 
expression through looping has been extensively documented'*”*. As 
expected, these looping interactions in the globin loci were not 
detected in GM12878 or HeLa-S3 cells that express little or no globin 
(Supplementary Fig. 3). Additional examples of cell-type-specific TSS- 
distal element interactions are shown in Supplementary Fig. 4. 
Furthermore, 5C interaction frequencies are correlated with TSS- 
distal DHS pairs predicted to be functionally connected based on their 
highly correlated activity across a large panel of cell lines (P< 10° '°, 
one-sided Mann-Whitney U-test'’), providing independent valid- 
ation of their biological significance. 

In each cell line we identified large numbers of statistically signifi- 
cant TSS—distal fragment interactions, of which ~60% were observed 
in only one of the three cell lines (Fig. 2a). These data point to intricate 
cell-type-specific three-dimensional folding of chromatin. 3C-based 
assays detect specific and functional interactions, for example, TSSs 
with gene regulatory elements*. In addition, the assay will detect ‘struc- 
tural’ interactions, for example, close spatial proximity as a result of 
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gene located in ENm002; right, y-d-globin located in ENm009). The solid red 
lines show the expected interaction level (Lowess line, Methods); dashed red 
lines above and below indicate Lowess + 1 standard deviation. 5C signals that 
are significantly higher than expected in both biological replicates (green 
circles, FDR = 1%) are considered looping interactions. Interactions that are 
significant in only one replicate (blue circles) are not considered as a high- 
confidence 5C looping interaction. 5C peak calling detects a long-range 
interaction between the TSS of ACSL6 and a distal CTCF-bound element in 
GM12878 cells. The approach identifies the known long-range interactions of 
y-6-globin to HS3, HS4, HS5 and HS-111 and several additional DHS and 
CTCE sites in K562 cells (labelled). 


other nearby specific looping interactions (bystander interactions) or 
overall higher order folding of the chromatin fibre. To determine 
which looping interactions involved distal sites that displayed specific 
chromatin features associated with functional elements, we compared 
our data with data sets generated by the ENCODE consortium (Fig. 2b 
and Supplementary Table 7). We found that looping interactions in all 
cell lines were significantly enriched for distal fragments that are 
bound by CTCF—a protein known to mediate DNA looping’*— 
contain open chromatin (as determined by FAIRE” or DHS 
mapping’’), and/or contain histones with modifications that are char- 
acteristic for active functional elements (H3K4mel, H3K4me2 and 
H3K4me3). Long-range interactions are also enriched for H3K9ac 
and H3K27ac, but are not enriched or significantly depleted for 
H3K27me3, a mark typically found at inactive or closed chromatin. 

To gain more insight into the types of element present in the distal 
looping fragments, we made use of genome-wide and cell-line-specific 
segmentation analyses that identified seven distinct chromatin states 
based on histone modifications, the presence of DHSs and the 
localization of proteins such as RNA polymerase II and CTCF (ref. 4 
and Fig. 2b). These states are: (1) enhancer (E); (2) weak enhancer 
(WE); (3) TSS; (4) predicted promoter flanking regions (PF); (5) 
insulator element (CTCF); (6) predicted repressed region (R); and 
(7) predicted transcribed region (T). The ENCODE consortium tested 
sets of the E elements in enhancer assays and confirmed that >50% 
display enhancer activity*. We found that looping interactions were 
significantly enriched for distal fragments that contained E, WE and 
CTCF elements, and the actively transcribed chromatin state (T), but 
were depleted for the repressed chromatin state (R). We note that some 
distal looping fragments contained elements classified as TSS or PF, 
even though they did not contain TSSs as defined by the GENCODE v7 
annotation”. Possibly, these are yet-to-be-annotated TSSs. 

Next, we used the seven-way segmentation data to categorize 
looping interactions into four broader functional groups (Fig. 2c, 
Supplementary Fig. 5 and Supplementary Data): those that involve a 
distal fragment that contains a putative enhancer (‘E’ (E or WE)), a 
putative promoter (‘P’ (TSS or PF)), or a CTCF-bound element 
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unclassified fragments contain chromatin features found at active 
chromatin elements (Supplementary Fig. 7). Thus, these are not simply 
noise or false positives, but are probably the result of the conservative 
segmentation approach. 

We found that TSS-E and TSS-P interactions are more cell-type 
specific than TSS-CTCF interactions: for the TSS-E and TSS-P 
categories, the ratio of interactions that is seen in only one cell line 
versus more than one cell line is ~4:1, whereas it is close to ~ 1:1 for the 
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Figure 2 | Distribution of looping interactions across cell types and their 
relationship with chromatin features and gene expression. a, Venn diagram 
showing the number of unique and overlapping looping interactions across 
three cell types. b, Heat map showing the enrichment/depletion of chromatin 
features in looping fragments compared to all interrogated fragments based on 
genome-wide data sets from the ENCODE consortium (Supplementary Table 
7). Features include open chromatin (UW-DHS (UW, University of 
Washington), Duke-DHS and UNC-FAIRE (UNC, University of North 
Carolina; FAIRE, formaldehyde-assisted isolation of regulatory elements)); 
active marks (Broad Institute histone H3K4me1/2/3, H4K20mel, H3K27ac, 
H3K9ac); CTCF (Broad Institute CTCF ChIP peaks); inactive marks (Broad 
Institute histone H3K27me3); and seven-way segmentation* (based on HMM 
prediction for indicated cells). We further grouped segmentation categories E 
and WE into ‘E class’, TSS and PF into ‘P class’, and R and T into ‘broad marks’. 
The colour scale represents the fold enrichment (red) or depletion (blue). The 
numbers listed inside each box represent P values of the significant (P < 0.05) 
enrichment/depletion for that mark, where (for example) E—32 indicates 
x10 ** (NS, not significant, grey; two-tailed hypergeometric test and corrected 
for multiple testing using Bonferroni). c, Venn diagram showing the number of 
unique and overlapping looping distal fragments (top) and looping interactions 
(bottom) among four functional groups in GM12878 cells. Distal fragments are 
classified into four non-exclusive groups based on the seven-way segmentation. 
Similarly, TSS—distal fragment interactions are classified based on the 
functional grouping of the distal fragments. The four functional groups are E 
class (yellow), P class (magenta), CTCF (cyan) and unclassified (grey). d, Pie 
charts showing percentages and numbers of expressed/non-expressed TSSs 
looping or not looping to a particular group (E, P, CTCF or unclassified; 
coloured as in c) of distal fragments in GM12878 cells. TSSs with a CAGE value 
>0 are deemed expressed. Significant enrichment for expressed TSSs in the 
looping or non-looping categories is indicated on top (hypergeometric test; 
Phyper < 0.05). Significant differences in expression levels between TSS in the 
looping versus the non-looping category is indicated on the left (Wilcoxon 
signed-rank test; Pwitcoxon < 0.05). 


(CTCF). The final class contains interactions with fragments that do 
not contain any of these three types of element, although they do 
contain T and R states (‘U’, unclassified). The last class is relatively 
large but is still significantly enriched in features that are characteristic 
for active functional elements such as H3K4mel, and over 60% of the 


TSS-CTCE category (Supplementary Fig. 5). The cell-type-specific 
activity of some of these E elements was confirmed using transient 
reporter assays (Supplementary Fig. 10). Next, we determined whether 
looping of a TSS to any of the four categories of chromatin states is 
correlated with transcription. We used CAGE expression data” to 
assign an expression level to each TSS. We found that looping 
interactions with fragments containing enhancer-like E elements were 
significantly enriched for those that involved expressed TSSs (Fig. 2d 
and Supplementary Fig. 6). In addition, the subset of TSSs that interact 
with fragments containing E elements was significantly more highly 
expressed compared to TSSs that do not interact with E elements. 
Interactions with other classes of element (CTCF, P and U) are sig- 
nificantly enriched for actively expressed genes in some, but not all, cell 
lines (Supplementary Fig. 6). 

Active enhancers often express enhancer RNAs’. We used a 
comprehensive enhancer RNA data set generated by the ENCODE 
consortium to determine whether TSSs preferentially interact with 
active enhancer-like elements”. We found that E elements that are 
looping to TSSs are significantly more likely to express enhancer RNAs 
than E elements that are not looping (P<5 X 10°, hypergeometric 
test, Supplementary Fig. 10). We conclude that looping interactions 
preferentially involve active enhancer-like elements. 

Next we analysed the distribution of long-range interactions 
upstream and downstream of TSSs. To generate this landscape of 
looping interactions we aligned all TSSs and calculated the average 
number of interactions that a TSS has with each class of distal element 
at increasing genomic distances upstream and downstream of the TSS. 
Figure 3a shows the resulting average long-range interaction profile 
across all three cell lines (similar results were obtained when each of 
the cell lines was analysed separately; Supplementary Fig. 8). Notably, 
we found that the long-range interaction landscape is asymmetric, 
with interactions of E, P and CTCF classes peaking around 120kb 
upstream of the TSS. This asymmetry of interactions reveals an 
unanticipated directionality in long-range interactions with TSSs. 
This may indicate the presence of topological constraints imposed 
by the mechanism by which such interactions regulate target 
promoters. No such bias was observed for the set of unclassified 
elements, or for the complete set of interrogated interactions 
(Fig. 3a). Interestingly, previous analyses showed that conserved 
non-coding elements are also often found within similar distances of 
target genes. Third, when we analysed expressed TSSs and non- 
expressed TSSs separately, we found that both have a similar 
interaction landscape but that expressed TSSs tend to have more inter- 
actions, especially with the E, P and CTCF classes. We cannot rule out 
the possibility that some TSSs classified as non-expressed based on the 
absence of CAGE tags are actually expressed at low levels. 

Next we explored whether the relative order of elements in the gen- 
ome affects which long-range interactions occur. It is often assumed 
that distal elements such as enhancers target the nearest TSS. Only ~7% 
of the looping interactions are between an element and the nearest TSS 
(Fig. 3b). This number goes up to 22% when only active TSSs are 
included. Similarly, 27% of the distal elements have an interaction with 
the nearest TSS, and 47% of elements have interactions with the nearest 
expressed TSS. Thus, when predicting TSS—distal element interactions, 
choosing the nearest (active) gene is often not correct. 

It has been suggested that CTCF sites located between an enhancer 
and a TSS may prevent enhancer-promoter interactions’*”, although 
in individual cases interactions over such sites have been observed'*”*. 
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Figure 3 | Looping landscape of TSSs to distal fragments. a, Composite 
profile of average number of group-specific looping interactions upstream and 
downstream of TSSs on the basis of combined 5C interaction data from the 
three cell lines. The top panel shows the average looping profiles of all TSSs 
(left), of expressed TSSs (middle) and of non-expressed TSSs (right). The 
bottom set of plots shows the corresponding profiles of all interrogated TSS- 
distal element interactions (left), of expressed TSSs (middle) and of non- 
expressed TSSs (right). All the interaction data for a particular group for all 
three cell lines are binned with a sliding window of 150 kb (step size of 5 kb) and 
normalized for the number of TSSs. b, Histogram showing the number of distal 
fragments that are involved in looping with their target promoters skipping 
0,1,2,...,25 (and above) TSSs. c, Histogram showing the number of looping 
interactions that skip over 0, 1, 2,..., 25 (and above) restriction fragments 
bound by either CTCF (left) or by both CTCF and RAD21 (cohesin; right). In 
b and c combined results for all three cell lines are plotted and values above 24 
on the x axis are added and grouped as 25+. Percentage of looping interactions 
that skip =1 CTCF (left) or CTCF plus cohesin (right) are indicated on top. 


To address this question we determined the frequency of identified 
long-range interactions between a TSS and a distal element that skip 
over one or more sites bound by CTCF. We found that 79% of long- 
range interactions are unimpeded by the presence of one or more 
CTCF-bound sites (Fig. 3c). Thus, the presence of a CTCF-bound site 
does not block physical long-range interactions. It has been reported 
that CTCF acts in conjunction with the cohesin complex to block 
promoter-enhancer interactions’. We found that 58% of looping 
interactions skip sites co-bound by CTCF and cohesin (Fig. 3c). We 
obtained similar results when the different categories of long-range 
interaction (TSS-E, TSS-P, TSS-CTCF and TSS-U) were analysed 
separately. Possibly, additional factors need to be recruited to CT'CF- 
bound sites to acquire interaction-blocking activity. 

The large number of long-range interactions that we discovered 
indicate that distal elements and TSSs are each engaged in multiple 
long-range interactions. To characterize this phenomenon in more 
detail we determined the interaction degree of TSSs and distal frag- 
ments. We found that ~50% of TSSs display one or more long-range 
interaction, with some interacting with as many as 20 distal fragments 
(Fig. 4a). Expressed TSSs interact with slightly more fragments as com- 
pared to non-expressed TSSs (the mean for GM12878 is 1.88 versus 
1.37, or 3.88 versus 3.25 when including only those TSSs with at least 
one interaction). Out of all distal fragments interrogated, ~10% inter- 
acted with one or more TSS, with some interacting with more than 10 
(mean of 2.15 (for GM12878) when including only those distal frag- 
ments with at least one interaction). The degree distribution of the four 
categories of distal elements was very similar (Supplementary Fig. 9). 

Figure 4b shows an example of the complex long-range interaction 
networks formed by TSSs and distal fragments in the ENr132 region in 
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K562 cells. It is unlikely that these interactions can all occur at the same 
time in the same cell, which is indicative of significant cell-to-cell 
variation. The data indicate that gene-element interactions are not 
exclusively one-to-one, and suggest that multiple genes and distal 
elements can assemble in larger clusters, as proposed for the B-globin 
locus”. 

Our data provide new insights into the landscape of chromatin loop- 
ing that bring genes and distant elements in close spatial proximity. In 
addition to generating a rich data set reflecting specific gene-element 
interactions, the average interaction profile of TSSs with surrounding 
chromatin reveals several general principles regarding the asymmetric 
relationships between genomic distance, the order of elements, and the 
formation of looping interactions. The bias for upstream interactions 
may indicate that the protein complexes on many TSSs may be 
asymmetric and may preferentially interact on one side with 
enhancer-protein complexes. It is also possible that the asymmetry 
of the long-range interaction landscape reflects a potential preference 
of looping to elements that are located in intergenic non-transcribed 
regions. Furthermore, although these average long-range interaction 
landscapes may facilitate computational prediction of long-range 
interactions throughout the genome, the fact that interactions skip 
genes and CTCF/cohesin sites indicates that additional mechanisms 
for target selection and gene insulation exist. 

Although conventional 3C may still be the method of choice to study 
the folding of individual loci, the 5C design strategy and data analysis 
methods applied here may provide a general approach for systematically 
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Figure 4 | Networks of looping interactions. a, Histogram showing the 
number of TSSs (left, red) or distal fragments (middle, blue) in percentages that 
are involved in 0, 1, 2,....10 (and above) looping interactions (degree, x axis) in 
GM12878 cells. All of the values for degrees that are >9 are grouped under 
degree 10+. The dark red bars represent the percentages of looping TSSs that 
are expressed whereas light red bars represent the percentages of looping TSSs 
that are not expressed. Inset: the difference in percentage between looping TSSs 
that are expressed and not expressed for each degree is shown. The right panel 
shows the degree distribution for each functional group of distal fragments. The 
average degrees (mean, i) for TSSs and distal fragments are indicated. The first 
value is the mean degree considering all the TSS/distal fragments (looping plus 
non-looping), whereas the second value is the mean degree of looping TSS/ 
distal fragments (excluding degree = 0). b, Web plot showing the long-range 
looping interactions in the ENr132 region in K562 cells. The interrogated distal 
fragments (blue circles) and the TSSs (red circles) are positioned according to 
genomic coordinates and the GENCODE v7 gene annotation is indicated. The 
size of the red circles indicates whether that TSS is expressed (large circles) or 
not expressed (small circles). The thin grey lines show all the interactions that 
were interrogated. The coloured lines show significant looping interactions 
between TSSs and distal fragments of a particular group. 
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mapping gene-element interactions for large gene sets. With further 
development of 3C technology and increases in sequencing capacity, 
similar high-resolution studies should become feasible to map specific 
long-range interactions throughout the genome, which may uncover 
additional principles that guide chromatin looping. Such insights will 
also be critical for interpreting genome-wide association studies that 
often identify regions with regulatory elements but not their distally 
located target genes. Co-published ENCODE-related papers can be 
explored online via the Nature ENCODE explorer (http://www. 
nature.com/ENCODE), a specially designed visualization tool that 
allows users to access the linked papers and investigate topics that 
are discussed in multiple papers via thematically organized threads. 


METHODS SUMMARY 


5C was performed using two pools of 5C primers: one for ENm001-ENm014 and 
ENr313, and one pool for all 30 randomly picked ENCODE regions (ENr111- 
ENr334)"' (Supplementary Tables 2 and 3). 5C libraries (two biological replicates 
per cell line) were sequenced on an Illumina GAIIx platform and sequence reads 
were mapped using Novoalign (http://www.novocraft.com), as described’. 
Interaction data for each experiment are available in GEO (accession number 
GSE39510). Statistically significant pair-wise interactions were identified 
(Methods) by converting each 5C signal into a z-score using the average 5C signal 
distribution versus genomic distance as a background estimate. Significant inter- 
actions (1% FDR) observed in both biological replicates were considered looping 
interactions. 5C looping interactions were compared to a variety of genome-wide 
data sets generated by the ENCODE consortium‘ (Supplementary Table 7). 


Full Methods and any associated references are available in the online version of 
the paper. 


Received 9 December 2011; accepted 1 June 2012. 


1. Dekker, J.Gene regulation in the third dimension. Science 319, 1793-1794 (2008). 

2. Dostie, J. et al. Chromosome conformation capture carbon copy (5C): A massively 
parallel solution for mapping interactions between genomic elements. Genome 
Res. 16, 1299-1309 (2006). 

3. ENCODE Project Consortium. A user’s guide to the encyclopedia of DNA elements 
(ENCODE). PLoS Biol. 9, e€1001046 (2011). 

4. ENCODE Project Consorium. An integrated encyclopedia of DNA elements in the 
human genome. Nature http://dx.doi.org/10.1038/naturel 1247 (this issue). 

5. Dekker, J., Rippe, K., Dekker, M. & Kleckner, N. Capturing chromosome 
conformation. Science 295, 1306-1311 (2002). 

6. Simonis, M. et a/. Nuclear organization of active and inactive chromatin domains 
uncovered by chromosome conformation capture-on-chip (4C). Nature Genet. 38, 
1348-1354 (2006). 

7. Zhao,Z. etal. Circular chromosome conformation capture (4C) uncovers extensive 
networks of epigenetically regulated intra- and interchromosomal interactions. 
Nature Genet. 38, 1341-1347 (2006). 

8. Miele, A. & Dekker, J. Long-range chromosomal interactions and gene regulation. 
Mol. Biosyst. 4, 1046-1057 (2008). 

9. Fullwood, M. J. et al. An oestrogen-receptor-a-bound human chromatin 
interactome. Nature 462, 58-64 (2009). 

10. Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions 
reveals folding principles of the human genome. Science 326, 289-293 (2009). 

11. ENCODE Project Consortium. Identification and analysis of functional elements in 
1% of the human genome by the ENCODE pilot project. Nature 447, 799-816 
(2007). 


LETTER 


2. Gheldof, N., Tabuchi, T. M. & Dekker, J. The active FMR1 promoter is associated 
with a large domain of altered chromatin conformation with embedded local 
histone modifications. Proc. Nat! Acad. Sci. USA 103, 12463-12468 (2006). 

3. Palstra, R. J. et a/. The B-globin nuclear compartment in development and 
erythroid differentiation. Nature Genet. 35, 190-194 (2003). 

4. Tolhuis, B., Palstra, R. J., Splinter, E., Grosveld, F. & de Laat, W. Looping and 
interaction between hypersensitive sites in the active B-globin locus. Mol. Cell 10, 
1453-1465 (2002). 

5. Bau, D. et al. The three-dimensional folding of the «-globin gene domain reveals 
formation of chromatin globules. Nature Struct. Mol. Biol. 18, 107-114 (2011). 

6. Vernimmen, D., De Gobbi, M., Sloane-Stanley, J. A., Wood, W. G. & Higgs, D. R. Long- 
range chromosomal interactions regulate the timing of the transition between 
poised and active gene expression. EMBO J. 26, 2041-2051 (2007). 

7. Thurman, R. E. et al. The accessible chromatin landscape of the human genome. 
Nature http://dx.doi.org/10.1038/nature1 1232 (this issue). 

8. Phillips, J. E. & Corces, V. G. CTCF: master weaver of the genome. Cell 137, 
1194-1211 (2009). 

9. Song, L. et al. Open chromatin defined by DNasel and FAIRE identifies regulatory 
elements that shape cell-type identity. Genome Res. 21, 1757-1767 (2011). 

20. Harrow, J. et al. GENCODE: The reference human genome annotation for the 
ENCODE project. Genome Res. http://dx.doi.org/10.1101/gr.135350.111 
(2012). 

21. Dong, X. et al. Correlating histone modifications and gene expression. Genome 
Biol.. (in the press). 

22. Kim, T.K. etal. Widespread transcription at neuronal activity-regulated enhancers. 
Nature 465, 182-187 (2010). 

23. Djebali, S. et al. Landscape of transcription in human cell lines. Nature http:// 
dx.doi.org/10.1038/naturel 1233 (this issue). 

24. Vavouri, T., McEwen, G. K., Woolfe, A. Gilks, W. R. & Elgar, G. Defining a genomic 
radius for long-range enhancer action: duplicated conserved non-coding 
elements hold the key. Trends Genet. 22, 5-10 (2006). 

25. Wallace, J.A. & Felsenfeld, G. We gather together: insulators and genome 
organization. Curr. Opin. Genet. Dev. 17, 400-407 (2007). 

26. Kurukuti, S. et al. CTCF binding at the H19 imprinting control region mediates 
maternally inherited higher-order chromatin conformation to restrict enhancer 
access to Igf2. Proc. Nat! Acad. Sci. USA 103, 10684-10689 (2006). 

27. Wendt, K. S. et al. Cohesin mediates transcriptional insulation by CCCTC-binding 
factor. Nature 451, 796-801 (2008). 

28. Lajoie, B. R., van Berkum, N. L., Sanyal, A. & Dekker, J. My5C: web tools for 

chromosome conformation capture studies. Nature Methods 6, 690-691 (2009). 


Supplementary Information is available in the online version of the paper. 


Acknowledgements We thank the University of Massachusetts Medical School Deep 
Sequencing core for sequencing 5C libraries, and R. Thurman and 

J. Stamatoyannopoulos for discussion and help with peak calling analysis. We thank 
M. Walhout and members of the Dekker laboratory for discussions. This work was 
supported by grants from the National Institutes of Health, National Human Genome 
Research Institute (HGO03143 and HGO03143-06S1) and a W.M Keck Foundation 
Distinguished Young scholar in Medical Research award to J.D. 


Author Contributions J.D. conceived the project. A.S. performed all experiments. B.R.L. 
designed 5C experiments, and built the data analysis and visualization pipelines. B.R.L, 
AS., GJ. and J.D. analysed the data and wrote the paper. 


Author Information All data are publicly available at GEO (accession number 
GSE39510). 5C data has also been deposited in the public UCSC ENCODE database 
(http://encodeproject.org/ENCODE/). 5C data can be found at http:// 
hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/ 
wgEncodeUmassDekker5C/. Reprints and permissions information is available at 
www.nature.com/reprints. This paper is distributed under the terms of the Creative 
Commons Attribution-Non-Commercial-Share Alike licence, and the online version of 
the paper is freely available to all readers. The authors declare no competing financial 
interests. Readers are welcome to comment on the online version of the paper. 
Correspondence and requests for materials should be addressed to J.D. 
(job.dekker@umassmed.edu). 


6 SEPTEMBER 2012 | VOL 489 | NATURE | 113 


©2012 Macmillan Publishers Limited. All rights reserved 


LETTER 


METHODS 


Cell growth conditions. GM12878 lymphoblastoid cells were procured from 
Coriell Cell Repositories and grown in RPMI 1640 medium supplemented with 
2mM t-glutamine, 15% fetal bovine serum (FBS) and antibiotic (1% penicillin- 
streptomycin). K562 (CCL-243), a CML cell line, and HeLa-S3 (CCL2.2), a 
cervical carcinoma cell line, were obtained from American Type Culture 
Collection (ATCC). K562 cells were cultured in similar media as GM12878 
cells except with 10% FBS, whereas HeLa-S3 cells were maintained in ATCC 
recommended F-12K medium (Kaighn’s modification of Ham’s F-12 medium) 
with 10% FBS and 1% penicillin-streptomycin. The culture densities and condi- 
tions were maintained as per recommendations of the repositories. 
Formaldehyde crosslinking. For suspension cells (GM12878, K562) a total of 
1 X 10° freshly growing cells were centrifuged at 100g for 5 min. Cell pellets were 
re-suspended in 45 ml of respective growth medium in a 50-ml Falcon tube. Cells 
were fixed by addition of 1.25 ml of 37% formaldehyde (final concentration of 
formaldehyde 1%). The cell suspension was gently mixed by inverting the tube up 
and down 4-6 times at room temperature and the tubes were rotated on an end-to- 
end shaker for exactly 10 min. Crosslinking was stopped by addition of 2.5M 
glycine (final concentration 125 mM) and cell suspensions were incubated at room 
temperature for 15 min using an end-to-end shaker. The crosslinked cells were 
then pelleted at 100g for 5 min and the cell pellet was stored at —80°C. For 
HelLa-S3 cells, the adherent cells were first trypsinized and then the crosslinking 
was performed as described above. 
5C analysis. 5C analysis was carried out as previously described*'> for the 44 
ENCODE Pilot regions (ENCODE manual (ENm) and ENCODE random 
(ENr)). The chromosomal position and coordinates of the regions as per the 
February 2009 GRCh37/hg19 human genome assembly are listed in 
Supplementary Table 1. The 5C experiment is designed to interrogate looping 
interactions between HindIII fragments containing transcription start sites (TSSs) 
and any other HindIII restriction fragment (distal fragments) in the ENCODE 
pilot regions. 
5C primer design. 5C primers were designed at HindIII restriction sites 
(AAGCTT) using 5C primer design tools previously developed and made available 
online at My5C website (http://my5C.umassmed.edu)**. Reverse 5C primers were 
designed for HindIII restriction fragments overlapping a known TSS from 
GENCODE transcripts, or overlapping a start site as experimentally determined 
by CAGE tag data of the ENCODE pilot project (Supplementary Table 2). Forward 
5C primers were designed for the remaining HindIII restriction fragments 
(Supplementary Table 3). For ENCODE regions that do not contain any TSS 
according to gene annotation in 2008 (ENr112, ENr113, ENr311 and ENr313), 
we used an alternative primer design. For these regions an alternating design of 
forward and reverse 5C primers was used in which forward and reverse primers 
are designed for alternating restriction fragments’. Note that ENr311 contains 
genes according to 2011 GENCODE v7 annotation”®. Primers were excluded for 
highly repetitive sequences that prevented the design of a sufficiently unique 5C 
primer. Primers settings were as described before’®: U-BLAST, 3; S-BLAST, 130; 
15-MER, 1,320; MIN_FSIZE, 40; MAX_FSIZE, 50,000; OPT_TM, 65; 
OPT_PSIZE, 40. The 5C primers contained up to 40 bases that were specific for 
the corresponding restriction fragment. If a shorter sequence was sufficient to 
obtain a predicted annealing temperature of 65 °C, that shorter sequence was used, 
and random sequence was added to make a total of 40 bases. All of the 5C primers 
have an extension of universal tail sequences at the 5’ end for forward 5C primers 
and at the 3’ end for reverse 5C primers. DNA sequence of the universal tails of 
forward primers was 5’-CCTCTCTATGGGCAGTCGGTGAT-3’; DNA 
sequence for the universal tails of reverse primers was 5’-AGAGAATGAGG 
AACCCGGGGCAG-3'. A six-base barcode was included between the specific 
sequence of the primers and the universal tail to aid in mapping of the high- 
throughput short sequencing reads. The length of each primer was 69 bp. In total, 
981 reverse primers and 5,321 forward primers were designed (corresponding to 
~77.1% (6,302 of 8,174) of all HindIII fragments in the 44 ENCODE regions). 
Generation of 5C libraries. 3C was performed with HindIII restriction enzyme as 
previously described’*”’ for GM12878, K562 and HeLa-S3 cells separately with 
two biological replicates for each cell line. The 3C libraries were then interrogated 
by 5C. The 44 ENCODE regions were analysed in two groups using two separate 
5C primer pools. The first group (ENm) contained the manually picked ENCODE 
regions ENm001-ENm014 and ENr313. The second group (ENr) contained the 
30 randomly picked ENCODE regions. The two 5C primer pools were made by 
pooling 5C primers for interrogating long-range interactions in the two groups of 
ENCODE regions. In these pools each primer was present at a final concentration 
of 0.5 fmol ul 1. 

The primer pool for the ENm group contained a total of 3,150 primers (476 
reverse 5C primers and 2,674 forward 5C primers). This primer pool allows 
interrogation of a total of 1,272,824 interactions. Of these, 83,427 interactions 


were between fragments that were both located in the same ENCODE region. 
The primer pool for the ENr group contained a total of 3,152 primers (505 reverse 
5C primers and 2,647 forward 5C primers). This primer pool allows interrogation 
of a total of 1,336,735 interactions. Of these, 34,859 interactions were between 
fragments that were both located in the same ENCODE region. 

5C was performed in 10-15 reactions each containing an amount of 3C library 
that represents 200,000 genome equivalents and 0.5 fmol of each primer. The 
multiplex annealing reaction was performed overnight at 55 °C. Pairs of annealed 
5C primers were ligated at the same temperature using Taq DNA ligase for 1h. 
Ligated 5C primer pairs, which represent a specific ligation junction in the 3C 
library and thus a long-range interaction between the two corresponding loci, were 
then amplified using 28 cycles of PCR with universal tail primers that recognize the 
common tails of the 5C forward and reverse primers. At least four separate amp- 
lification reactions were carried out for each of 10-15 annealing reactions 
described above and all the PCR products were pooled together. This pool con- 
stitutes the 5C library. The libraries were concentrated using Qiaquick PCR puri- 
fication kit and a 3’-A tailing reaction was done using dATP and Taq DNA 
polymerase in the presence of 1X standard Taq buffer (NEB) at 72 °C for 30 min. 

To facilitate Iumina paired-end DNA sequence analysis of 5C libraries, 
Illumina paired-end adaptor oligonucleotides (Illumina) were ligated to the 5C 
library using the Illumina PE protocol. The linkered 5C library was then amplified 
by PCR (17 or 18 cycles, with Phusion High Fidelity DNA polymerase) using 
Illumina PCR primer PE 1.0 and 2.0. The 5C library was gel purified and 
sequenced on the Illumina GAIIx platform, generating 36-bp paired-end reads. 
5C read mapping. Sequencing data was obtained from an Illumina GAIIx 
machine and was processed through a custom pipeline to map and assemble 5C 
interactions. We used 36-bp paired-end reads to sequence all 5C libraries. Owing 
to sequencing efficiency, some 5C libraries were re-sequenced as many as ten times 
to obtain the required read depth for our analysis. 

The fastQ files were taken directly from the Illumina GAIIx and fed into our 
in-house 5C mapping pipeline. Each side of the paired end read was independently 
mapped to a pseudo-genome of all possible 5C primer sequences using the 
novoalign mapping algorithm (V2.05 http://novocraft.com). The default align- 
ment settings for novoalign were used. After mapping, if both of the paired-end 
reads could be uniquely mapped to a 5C primer, a 5C interaction was assembled. 
Invalid interactions between the same primer or between primers of the same type 
were removed as these would represent a mapping artefact or an issue with the 5C 
technique. The number of invalid interactions detected across all libraries was 
<0.01%, which would be expected if solely due to random mapping errors. 

Statistics regarding the 5C library quality, mapping efficiency, etc. can be found 

in Supplementary Table 4. Because it is only necessary to map the paired-end reads 
to the list of all possible 5C primers rather than to the entire genome, a higher 
percentage of mapped/usable reads can be achieved. We found that >90% of all 
paired-end reads (after Illumina chastity filtering) can be uniquely mapped to a 
single 5C interaction. For libraries where more than one lane was used to achieve 
adequate sequence depth, the interactions from each lane were summed to pro- 
duce the complete 5C interaction data set. A table summarizing the read depth of 
each 5C library can be found in Supplementary Table 5. Pearson correlation 
coefficients between the biological replicates can be found in Supplementary 
Table 6. 
Detection bias correction. 5C experiments involve a number of steps that can 
locally differ in efficiency, thereby introducing biases in efficiency of detection of 
pairs of interactions. These biases could be due to differences in the efficiency of 
crosslinking, the efficiency of restriction digestion (related to crosslinking effi- 
ciency), the efficiency of ligation (related to fragment size), the efficiency of 5C 
primers (related to annealing and PCR amplification) and finally the efficiency of 
DNA sequencing (related to base composition). All of these potential biases— 
several of which are common to other approaches such as chromatin immuno- 
precipitation (for example, crosslinking efficiency, PCR amplification, base-com- 
position-dependent sequencing efficiency)—will have an impact on the overall 
efficiency with which long-range interactions for a given locus (restriction frag- 
ment) can be detected. To determine this overall efficiency of interaction detection 
we have developed the following general strategy. To determine overall interaction 
detection efficiency for a given restriction fragment we analysed the large set of 
interchromosomal interactions that are detected for each fragment. We then 
defined the overall efficiency of interchromosomal interaction detection for a 
given fragment as the ratio of the average interchromosomal signal obtained with 
that fragment and the average interchromosomal signal of all fragments. We then 
corrected the frequency of each interrogated long-range intrachromosomal inter- 
action using a correction factor that is the product of the overall efficiency of 
interchromosomal interaction detection for the two interacting fragments. 

This procedure will correct for any of the biases in detectability of interactions 
fora given locus, as listed above, and will also adjust for copy number variation of a 
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locus, which can vary in transformed cell lines such as K562 and HeLa-S3 cells, as 
these factors will also affect the level of interchromosomal interactions. 
Detailed primer filtering. To approximate the relative 5C signal of each restric- 
tion fragment interrogated in the experiment we first calculated the average 5C 
signal for all trans interactions (interactions between different chromosomes). To 
remove any extreme outliers from the mean calculation (for example, due to 
primer failure) we first filtered down the distribution of 5C signals in trans for 
each restriction fragment by removing all signals beyond the mean + 3 standard 
deviations (s.d.). After calculating the filtered mean for each restriction fragment 
in trans, we calculated the global mean of all interchromosomal interaction 
frequencies. We then calculated a correction factor for each restriction fragment 
that would normalize its set of trans interactions to the entire set. Once the 
correction factors were calculated, we then calculated the mean and s.d. correction 
factor and flagged any restriction fragments requiring a correction value beyond 
the mean + 1.654 s.d. Fragments with a correction factor outside of this limit were 
flagged for removal as their trans signal is too above/below the expected signal 
by chance. Here, we assume that any variation in 5C signals detected within the 
trans space is due to experimental factors, differing primer efficiencies, ligation 
efficiencies, etc. 
Detailed primer correction. Once the outlier fragments are removed from the 5C 
data set, we repeated the above-described steps to calculate the primer correction 
values required to normalize the 5C signals for the remaining restriction fragments. 
Then, for each 5C interaction within an ENCODE region in the data set, we used 
the product of the correction factors from the two restrictions fragments involved 
in the interaction as the final correction factor to apply to the 5C signal. 5C signals 
were then either increased or decreased by the correction factor to correct for 
varying signals from the fragments visibility in the trans interaction space. 
Peak calling. To detect significant looping interactions from background looping 
interactions we developed an in-house ‘5C peak calling’ algorithm. We chose to 
call peaks in each 5C biological replicate separately and then take only the peaks 
that intersect across replicates as our final list of significant looping interactions. 
5C signals represent the three-dimensional contact probabilities between 
pairs of loci. This relationship inversely scaled with genomic distance. To control 
properly for the varying genomic distances tested in the 5C data set, we first 
determined the relationship of 5C signals over genomic distance. Using a 
Lowess smoothing algorithm we found the weighted average and weighted s.d. 
of all 5C signals across the range of all interrogated genomic distances. We used the 
traditional tri-cubic weighting function and an « parameter of 0.01 to average 
the closest 1% of the 5C signals around each genomic distance. We assumed that 
the large majority of interactions are not significant looping interactions and thus 
we interpreted this weighted average as the expected 5C signal for any given 
genomic distance. The 5C signals were then transformed into a z-score by cal- 
culating the (obs — exp/s.d.). Where the obs value is the detected 5C signal for a 
specific interaction, exp is the calculated weighted average of 5C signals for a 
specific genomic distance, and s.d. is the calculated weighted standard deviation 
of 5C signals for a specific genomic distance. Once the z-scores were calculated, the 
distribution of z-scores was fit to a Weibull distribution. We found that the 
distribution of z-scores fits to the Weibull distribution with an R* value of 
>0.939 for all cell lines. P values can then be mapped to each z-score and then 
also transformed into q values for FDR analysis. The ‘q value’ package from R 
(qvalue.cal [siggenes]) was used to compute the q values for the given set of 
P values determined from the fit to the Weibull distribution. Using an FDR cutoff 
of 1%, we selected all 5C interactions with a q value <0.01. We then took the 
intersection of all significant looping interactions across the two biological repli- 
cates as our final list of 5C looping interactions. 
Estimation of frequency of false-positive looping interactions. We defined a 
false-positive 5C looping interaction as an interaction that is identified in the peak 
calling approach described above but is due to technical biases or noise and thus 
does not reflect a biologically meaningful long-range interaction. To estimate the 
frequency by which our approach detects significant looping interactions by 
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chance, we analysed 5C data obtained for the three ENCODE regions that are 
devoid of genes and are almost devoid of active regulatory elements (according the 
ENCODE seven-way segmentation’). As described above, we used an alternating 
5C primer design for these regions. As a result, long-range interaction profiles are 
not specifically anchored on any type of genomic element. Combined with the fact 
that these regions are largely devoid of any functional elements, we do not expect 
to detect any significant looping interactions. Thus, assessment of the number of 
looping interactions detected for these regions using our peak-calling pipeline 
provides an empirical approach to estimate the frequency by which significant 
looping interactions are detected by chance and thus represent false positives. 

Supplementary Fig. la shows the number of peaks detected in the three gene 
desert ENCODE regions (ENr112, ENr113 and ENr313). We used these numbers 
to estimate the frequency with which we detect significant looping interactions by 
chance. For GM12878 cells we identified 17 significant looping interactions in 
both replicates. For these three ENCODE regions we interrogated 7,819 5C inter- 
actions. Thus, we estimate that the fraction of interrogated interactions that by 
chance scores as a significant long-range interaction: (17/7,819)100 = 0.217%. 
Assuming that this fraction is the same for the set of 82,545 interrogated TSS- 
distal element interactions throughout the ENCODE regions, we expect to detect 
(0.217 X 82,545)/100 = 179 false-positive looping interactions. We detected 1,011 
significant looping interactions between TSSs and distal sites in GM12878 cells, 
which leads us to estimate that the false-positive detection rate is around 18% 
[(179/1,011)100]. Similar analyses of 5C data from K562 and HeLa-S3 cells lead to 
estimates of false-positive detection rates of 10% and 12%, respectively, corres- 
ponding to 147 out of 1,434 and 190 out of 1,620 looping interactions possibly 
being false positives. We note that these represent upper limit estimates, as some of 
the significant looping interactions detected in the gene desert regions may be real. 

The false-positive detection rate for single replicates can be calculated in exactly 
the same way. We found that the fraction of significant looping interactions 
detected in one replicate that might be false positives ranges from 20% to 47%. 
Thus, by requiring interactions to be significant in both replicates, we greatly 
reduce the fraction of false-positive significant interactions (from 20-47% to 
10-18% of the significant interactions). At the same time, many of the significant 
interactions detected in only one replicate were not false positives, and by exclud- 
ing this subset of interactions from our analysis we introduce false negatives. 
Consistent with our interpretation that many of the peaks seen in only one rep- 
licate represent false negatives, we found that when we take the union of the peaks 
found in replicates 1 and 2, or analyse the set of peaks obtained with individual 
replicates separately, all of the results that we presented remain the same: (1) 
enrichment for distal elements that resemble active gene regulatory elements 
(Supplementary Fig. le); (2) asymmetry of the long-range interaction landscape 
with a peak around 120kb upstream of the TSS (Supplementary Fig. 8); (3) 
skipping over CTCF sites; and (4) formation of interwoven interaction networks. 
The fact that all our results can be obtained using different peak sets (for example, 
the union of two replicates, or the intersection of the replicates) indicates that our 
basic findings are robust and not very sensitive to where the threshold for peaks is 
placed. By focusing exclusively on the set of peaks independently detected in both 
replicates we are being conservative, only report the strongest signals that display 
the strongest enrichments for active chromatin features (Supplementary Fig. 1), 
and reduce the false-positive rate. 

In general we prefer false negatives over false positives. 
Fragment annotation. To annotate the interrogated restriction fragments, a 
variety of ENCODE data sets were used to check for overlap with our list of 
restriction fragments. A list of all used ENCODE data sets can be found in 
Supplementary Table 7. 
Supplementary data. A zip archive containing all Supplementary Data can be 
found in Supplementary Information. 


29. Dostie, J. & Dekker, J. Mapping networks of physical interactions between 
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Structure of a RING E3 ligase and 
ubiquitin-loaded E2 primed for catalysis 


Anna Plechanovoval, Ellis G. Jaffray’, Michael H. Tatham’, James H. Naismith? & Ronald T. Hay! 


Ubiquitin modification is mediated by a large family of specificity determining ubiquitin E3 ligases. To facilitate ubiquitin 
transfer, RING E3 ligases bind both substrate and a ubiquitin E2 conjugating enzyme linked to ubiquitin via a thioester 
bond, but the mechanism of transfer has remained elusive. Here we report the crystal structure of the dimeric RING 
domain of rat RNF4 in complex with E2 (UbcH5A) linked by an isopeptide bond to ubiquitin. While the E2 contacts a 
single protomer of the RING, ubiquitin is folded back onto the E2 by contacts from both RING protomers. The 
carboxy-terminal tail of ubiquitin is locked into an active site groove on the E2 by an intricate network of 
interactions, resulting in changes at the E2 active site. This arrangement is primed for catalysis as it can deprotonate 
the incoming substrate lysine residue and stabilize the consequent tetrahedral transition-state intermediate. 


By altering the fate of modified proteins, conjugation with ubiquitin 
and its homologues has a central role in eukaryotic biology under- 
pinning cell signalling, protein degradation and stress responses. In 
most cases ubiquitin is transferred to its target proteins from a thioe- 
ster complex with a ubiquitin conjugating enzyme (E2) by a large 
family of ubiquitin E3 ligases (E3)'. The RING family of E3s, of which 
over 600 are encoded in the human genome, possess a conserved 
arrangement of cysteine and histidine residues that coordinate two 
zinc atoms’. RING E3 ligases bind both substrate and E2-ubiquitin 
(E2-Ub) thioester, but the molecular basis by which the RING acti- 
vates the E2-Ub bond for transfer of ubiquitin to substrate has 
remained elusive. 

RNF4 is aSUMO-targeted ubiquitin ligase’ that has a key role in the 
DNA damage response*® and in arsenic therapy for acute promyelo- 
cytic leukaemia”*. RNF4 contains multiple SUMO interaction motifs, 
allowing it to engage polysUMO-modified substrates, and a RING 
domain’ that is responsible for dimerization and catalysis of ubiquitin 
transfer’’. Our understanding of RING-catalysed ubiquitination has 
been hindered by the lack of structures of the key intermediate: a 
RING bound to E2-Ub. Obtaining this key complex is difficult, as 
the thioester (or engineered oxyester) bond linking E2 and ubiquitin is 
highly activated and unstable in the presence of an E3. 


Structure of the RING-UbcH5A-Ub complex 

We have engineered a mimic of the E2-Ub thioester bond by replacing 
the active site cysteine of the E2 UbcH5A (also called UBE2D1) with a 
lysine to generate an isopeptide (amide) bond between the C terminus 
of ubiquitin and the ¢-amino group of the introduced lysine 
(Supplementary Figs 1 and 2). Isopeptide-linked UbcH5A-Ub bound 
selectively to the RNF4 RING and acted as a potent inhibitor of RNF4- 
mediated substrate ubiquitination, confirming that it is an excellent 
mimic, but crucially, that it is stable in the presence of RNF4 
(Supplementary Fig. 3). The E2-Ub mimic was mixed in a 2:1 ratio 
with a fused RNF4 RING dimer’ and crystallized. A 2.2 A structure of 
the resulting complex was determined (Supplementary Table 1). The 
asymmetric unit contains the central RNF4 RING dimer, two UbcH5A 
molecules and two ubiquitin molecules related by a two-fold axis 
(Fig. 1). Each UbcH5A molecule contacts a single RING domain and 


is linked by an isopeptide bond to ubiquitin (Supplementary Fig. 4) that 
sits at the RING dimer interface. The complex can be envisaged as a 
dimer of heterotrimers (RING monomer, UbcH5A and ubiquitin). 

Strikingly, ubiquitin is folded back onto the E2, creating an inter- 
face that buries approximately 1,800 A’, has 15 hydrogen bonds and 4 
salt bridges. L8 of ubiquitin interacts with L97 and K101 of UbcH5A, 
whereas 144, H68 and V70 in ubiquitin are close to L104, S105 and 
$108 on the «2 helix of the E2 (Fig. 2a). Extensive contacts are evident 
between the C-terminal 6 residues of ubiquitin and loops surrounding 
the active site of UbcH5A, particularly residues L86, D87, Q92 and 
N114. The side chain of N77 in UbcH5A forms a hydrogen bond to 
the isopeptide carbonyl (Fig. 2b). Mapping conserved E2 residues 
(Supplementary Fig. 5) shows that highly conserved residues sur- 
round the active site and the shallow groove that accommodates the 
C-terminal region of the linked ubiquitin (Supplementary Fig. 6). The 
other conserved cluster of E2 residues constitutes the binding site for 
the E3 ligase. 

UbcHSA contacts a single protomer of the RING (Supplementary 
Fig. 7) and the interface is very similar to that previously described for 
RING-E2 complexes’®"". At the junction of the three molecules in the 
heterotrimer is a hydrophobic cluster formed by L8, T9 and L71 of 
ubiquitin, A96 and L97 of UbcH5A, and P137, P178 and R181 of the 
RING (Fig. 2c and Supplementary Fig. 7). Ubiquitin contacts both 
protomers of the RING dimer and the interface buries 940 A? 
(Fig. 2c). Residues L8 to K11 and L71 with R72 of ubiquitin contact 
RING residues T179 to R181 within the same heterotrimer, whereas 
the Q31 to Q40 region of ubiquitin contacts both protomers of the 
RING dimer. The backbone carbonyl of ubiquitin E34 makes a hydro- 
gen bond with RING residue H160 (zinc ligand) and the main-chain 
E34 to G35 of ubiquitin stacks with the side chain of Y193 of the RING 
domain from the other heterotrimer (Fig. 2d). These interfaces 
explain why dimerization of the RNF4 RING is required for activity*”. 
Phylogenetic analysis of RNF4 from a wide range of species and 
sequence comparison of RNF4 with other dimeric RING and U-box 
E3 ligases indicate that the bound ubiquitin interacts with conserved 
features of the RING (Supplementary Fig. 8). 

The RING domain does not undergo any major structural change 
as a result of complex formation (Supplementary Fig. 9a). Ubiquitin 
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Figure 1 | Structure of the RNF4 RING bound to ubiquitin-loaded 
UbcHSA. a, Surface representation of the complex. Individual RING 
protomers are coloured cyan and blue, UbcH5A is green, ubiquitin is orange 
and the isopeptide linkage between the C terminus of ubiquitin and K85 of 
UbcHSA is shown in yellow. b, Ribbon diagram of the complex with the same 
orientation and colour scheme as in a. Zinc atoms are indicated as grey spheres. 
c, As in b, but the complex is rotated by 90° as indicated. 


shows little change in overall structure up to R72; the remaining five 
residues are, however, positioned differently as a consequence of 
being held in the active site groove of UbcH5A. The loop at L8 in 
ubiquitin has moved over 4 A to form the hydrophobic cluster with 
UbcHS5A and the RING (Supplementary Fig. 9b). Superposition of the 
coordinates of unconjugated UbcHSA either free (Protein Data Bank 
accession 2C4P), or in a variety of non-covalent complexes'*"*, and 
UbcH5A in the present structure reveals a clear re-arrangement 
centred on D117. In the unconjugated structures, the side chain 
of D117 points towards C85, in a position that would clash with the 
isopeptide (thioester) bond observed in our complex (Supplementary 
Fig. 9c, d). 


E2 and ubiquitin residues required for activity 

Previous mutational analysis revealed the importance of the RING 
residue R181—which contacts both E2 and ubiquitin in the present 
structure (Fig. 2)—in the ubiquitination activity of RNF4 (ref. 3). 
Moreover, Y193 in the RING plus L8 and 144 in ubiquitin were shown 
to be required for activity’. Although it was thought that these resi- 
dues might interact directly, the present structure emphasizes their 
importance but shows that they are not in direct contact. To validate 
our structure further, we introduced mutations into ubiquitin and 
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UbcH5A (Fig. 3a, b) and tested these in a single-turnover substrate 
ubiquitination assay. Mutations of hydrophobic residues 144 (ubiqui- 
tin) and L104 (UbcH5A) at the interface between ubiquitin and 
UbcH5A abolished ubiquitination activity, whereas mutations 
K101A, S108A and D112A in UbcH5A and R42A in ubiquitin 
reduced activity modestly (Fig. 3c, d and Supplementary Figs 10- 
13). Ubiquitin mutations G35A and 136A (both at the RING inter- 
face) substantially (>10X) reduced activity. Significant reductions in 
ubiquitination were also observed for mutations of L8 and L71 in 
ubiquitin and L97 in UbcH5A that form a hydrophobic core at the 
junction of all the three molecules in the heterotrimer. In the E2 active 
site groove, mutations N77A and D87A in UbcHS5A abolished activ- 
ity, whereas D117A severely compromised activity. N114A in 
UbcH5A and R72A, L73A and R74A in ubiquitin displayed modestly 
reduced activity (Fig. 3c, d and Supplementary Figs 10-13). 

To discriminate between residues in ubiquitin and E2 that influence 
the ability of the substrate lysine to carry out nucleophilic attack on the 
E2-Ub thioester and those residues involved in activating the E2-Ub 
bond, we carried out substrate-independent assays that measure the 
ability of the RNF4 RING to catalyse hydrolysis of an E2-Ub oxyester 
bond’ (Fig. 3e, f and Supplementary Figs 14 and 15). Mutations in 
ubiquitin and UbcH5A that reduced substrate-dependent ubiquitina- 
tion also reduced oxyester hydrolysis, with the important exception of 
D117A, which was defective in substrate ubiquitination but retained 
wild-type levels of oxyester hydrolysis (Fig. 3d, f). 

We investigated whether residues in ubiquitin and UbcH5A that 
are important for RNF4-mediated ubiquitination have a more general 
role in E3-catalysed transfer. The ubiquitin and UbcH5A mutants 
were tested in combination with the unrelated U-box E3 ligase 
CHIP (C terminus of Hsp70-interacting protein) using an autoubi- 
quitination assay. Although there are relatively modest quantitative 
differences in ubiquitination, the effect of the mutations on CHIP 
and RNF4 activity is very similar (Fig. 4). Thus, it is likely that 
a conserved mechanism is used by CHIP and RNF4 to catalyse 
ubiquitin transfer. 


Mechanism of RING-mediated ubiquitination 


Using the isopeptide-linked E2-Ub in our crystal structure, we con- 
structed a model of the E2-Ub thioester by replacing K85 in UbcH5A 
with a cysteine and minimizing the geometry. The resulting model 
shows very minor changes: the Sy and Cu atoms in C85 are shifted 
1.0 Aand0.2 A from Cy and Co atoms of K85, with smaller changes in 
184 and L86. In ubiquitin the Ca atoms of G76 and G75 have moved 
0.5 A and 0.2 A, respectively. The carbonyl group of the thioester at 
G76 has moved 0.6 A and rotated around 45°, resulting in the hydro- 
gen bond with N77 being extended to 3.6 A (Fig. 5a, b). Coupled with 
the mutational analysis and evidence that the isopeptide-linked E2- 
Ub is a competitive inhibitor of ubiquitination, we conclude that the 
crystal structure is a relevant model for the key E2-Ub-RING hetero- 
trimeric intermediate. 

In the absence of an E3 ligase, the ubiquitin thioester linked to the 
E2 can adopt a wide range of different conformations that also include 
a ‘folded-back’ conformation’*”’”. As free ubiquitin has no detectable 
affinity for the RNF4 RING we suggest that the initial interaction will 
be between E2 and the RING. In this encounter, with the E2 bound to 
one RING protomer, the thioester-linked ubiquitin would be engaged 
by Y193 of the other RING protomer and folded back to contact the 
a2 helix of UbcH5A, while its C terminus is extended and locked in 
the active site groove of the E2. This orientates the planar thioester 
bond such that the ubiquitin G76 thioester carbonyl is in the optimal 
arrangement for nucleophilic attack by the incoming substrate lysine. 
This arrangement of the E2 active site was not observed in a UbcH5B- 
Ub oxyester alone’* or when a UbcH5B-Ub oxyester is bound to a 
HECT £3 ligase’’ (Fig. 5c, d). The nucleophilic attack by the substrate 
lysine would result in formation of a tetrahedral intermediate on the 
G76 carbonyl carbon. The G76 carbonyl oxygen, with its developing 
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Figure 2 | Molecular interfaces in the RNF4 RING-UbcH5A-Ub complex. 
a, Detail of the interaction between ubiquitin (orange) and the «2 helix of 
UbcHSA (green). b, Detail of the interaction interface between ubiquitin 
(orange) and UbcHSA (green) in the E2 active site groove. The side chain of 
K85 in UbcHSA that forms the isopeptide bond with ubiquitin is coloured 


negative charge, would move down below the plane of the original 
thioester bond and form a hydrogen bond to N77, stabilizing the 
tetrahedral intermediate. In fact the atoms would move towards the 
experimental orientation of the carbonyl in the isopeptide bond that 
makes a 2.8 A hydrogen bond with N77. The role of UbcH5A D117, 
which sits above the thioester and is re-positioned by ubiquitin bind- 
ing, has been clarified by analysis of the D117A mutant. Of the 
mutants which are defective in the ubiquitination assay, only 
D117A retains wild-type levels of oxyester hydrolysis (Fig. 3f). 
Because the E2-Ub oxyester bond is hydrolysed in the presence of 
E3 (no transfer to substrate)’, only a residue with the sole function to 
position and/or activate the incoming lysine nucleophile should pos- 
sess activity in oxyester hydrolysis assays but be inactive in ubiquiti- 
nation. 


Implications for transfer of ubiquitin and related 
modifiers 

This is the first structure of a RING E3 ligase bound to a ubiquitin- 
loaded E2, but the mechanism proposed here for ubiquitin transfer to 
substrate is consistent with previous work. Key roles for residues N77 
(ref. 20) and D117 (ref. 21) in E2 catalytic activity have been suggested 
previously. Evidence that activation of the thioester bond requires 
both ubiquitin/ubiquitin-like modifier (Ubl) and E2 to be bound by 
the E3 comes from previous work on RNF4 (ref. 3), the SIZ1 (ref. 22) 
and RanBP2 (ref. 23) SUMO E3 ligases, and the NEDD4L HECT E3 
ligase’’. The folded-back conformation where the 144 hydrophobic 
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violet. c, The hydrophobic cluster at the centre of the ubiquitin (orange), 
UbcHSA (green) and RING (cyan) heterotrimer. d, Stacking interaction 
between the main chain of ubiquitin (orange) in one heterotrimer and Y193 of 
the RING (blue) from the other heterotrimer. 


patch of ubiquitin (or equivalent region of SUMO) engages the «2 
helix of the E2 has been suggested as an intermediate in ubiquitin/Ubl 
transfer based on NMR models’*”’, mutagenesis coupled with mod- 
elling’’**, and from the structure of a SUMO substrate-E2-E3 prod- 
uct complex”***. Comparing the NMR model of UBC1 (also called 
UBE2K)-Ub thioester’? with the present structure shows that 
although ubiquitin in the UBC1-Ub thioester is in the folded-back 
conformation, it is different from the present structure where inter- 
actions between ubiquitin and the RING extend and exert tension on 
the ubiquitin C terminus, locking it down into the E2 active site 
groove. In the absence of its cognate E3 the ubiquitin C-terminal tail 
in the UBC1-Ub complex is not locked down in the UBC1 active site 
groove and the thioester is thus not activated (Supplementary Fig. 16). 

The folded-back conformation was also observed in the structure of 
SUMO-modified RanGAP1 in complex with UBC9 (also called 
UBE2I) and the SUMO E3 ligase RanBP2 (ref. 25) (trapped product 
complex). The position of the SUMO C-terminal tail and hydrogen 
bonding interactions within the active site groove of UBC9 are 
remarkably similar to those seen for UbcH5A-Ub bound to the 
RNF4 RING (Supplementary Fig. 17). Although both RNF4 and 
RanBP2 interact with ubiquitin/SUMO to lock it into this conforma- 
tion, molecular details of these contacts are rather different. Whereas 
the RING domain interacts with a hydrophobic patch in ubiquitin 
containing L8, 136 and L71, RanBP2 holds SUMO using a SUMO 
interaction motif. Superimposing UBC9 from the RanGAP1- 
SUMO-UBC9-RanBP2 complex with UbcH5A from the RNF4 
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Figure 3 | Mutational analysis of the RNF4 RING-UbcH5A-Ub complex. 
a, Side chains of altered residues in ubiquitin contacting RNF4 (blue), UbcH5A 
(green), or both RNF4 and UbcH5A (yellow). b, Side chains of altered residues 
in UbcHSA contacting RNF4 (blue), ubiquitin (orange), both RNF4 and 
ubiquitin (yellow), or neither (green). c, Reaction rates were determined 
(mean = s.d. of duplicates) for single-turnover, RNF4-dependent substrate 


RING-UbcH5A-Ub structure allows a model of the catalytic transfer 
complex to be constructed (E2-Ub thioester, E3 and substrate) 
(Fig. 5e and Supplementary Fig. 17d, e). This model both unifies 
and provides clear molecular rationale for a body of existing data 
on ubiquitination. 

Although RNF4 is a structurally simple E3 ligase it seems likely that 
similar principles of E2-Ub activation will be used by structurally 
more complex ubiquitin ligases such as the cullin-based ligases”® 
and the anaphase promoting complex/cyclosome” that are also 
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Figure 4 | The same interfaces in E2 and ubiquitin are important for CHIP 
and RNF4 activity. a, Autoubiquitination activity of CHIP (top panel) and 
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RING dependent. Our data suggest E3 ligases for other Ubls are also 
likely to use a similar catalytic mechanism**”. The unifying concept is 
that the E3 activates E2-Ub/UbI thioester by holding the Ub/Ubl in 
the folded-back position, extending its C-terminal tail. This is akin to 
tensioning a spring that would be released by cleavage of the thioester 
and formation of the isopeptide bond. Although details of the molecu- 
lar contacts that fold back the Ub/UbI will vary, it is the position of the 
C-terminal tail of the Ub/Ubl in the active site groove of the E2 that is 
central to the process. 
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anti-ubiquitin antibody are shown. Longer exposure is shown for I36A and 
L71A ubiquitin, as binding of the antibody is affected by these mutations. 
b, Autoubiquitination activity as in a, but with UbcH5A mutants. 
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Figure 5 | E3-mediated structural changes associated with the catalytically 
primed form of UbcH5A-Ub. a, Model of UbcH5A-Ub thioester (grey) 
compared with isopeptide-linked UbcH5A(C85K)-Ub (K85 is violet). 

b, Comparison of modelled thioester with isopeptide linkage. Hydrogen bonds 
are black (isopeptide) or grey (modelled thioester) dashes, with distances 
shown in A. c, Comparison of the position of ubiquitin relative to E2 in the 
UbcH5A-Ub-RING complex reported here with the UbcH5B-Ub- 


METHODS SUMMARY 


Recombinant proteins were expressed in Escherichia coli cells and purified by 
standard methods. For structural analysis of a stable mimic of the UbcH5A-Ub 
thioester, mutations C85K and §$22R** were introduced into UbcH5A 
(UbcH5A(S22R/C85K)). The isopeptide bond-linked UbcH5A(S22R/C85K)- 
Ub conjugate was prepared by incubating UbcH5A(S22R/C85K) (200 LM) with 
Hisg-tagged ubiquitin (200 1M) and El (1M) at 35°C for 26h in a buffer 
containing 3mM ATP, 5mM MgCl, 50mM Tris pH 10.0, 150 mM NaCl and 
0.8mM TCEP. The E2-Ub conjugate was purified by Ni’* -affinity chromato- 
graphy. Hisg-tag was removed using TEV protease and the conjugate was further 
purified by Ni**-affinity chromatography and gel filtration chromatography. 
The RNF4 RING-UbcH5A(S22R/C85K)-Ub complex was prepared by mixing 
the UbcH5A(S22R/C85K)-Ub with a linear fusion of two RNF4 RING domains 
ina 2:1 molar ratio. Crystals grew from a 1:1 sitting-drop with a reservoir solution 
containing 18% (w/v) PEG 3,000, 0.1 M Tris (pH 7.2), and 0.2 M calcium acetate. 
The structure was solved by molecular replacement to a resolution of 2.2 A using 
in house X-rays. A single-turnover substrate ubiquitination assay for RNF4 has 
been described previously’. 


Full Methods and any associated references are available in the online version of 
the paper. 
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METHODS 


Cloning, expression and purification of recombinant proteins. Expression and 
purification of Rattus norvegicus RNF4, human UbcHSA, and His,-tagged linear 
fusion of four SUMO2 molecules (4 X SUMO-2) has been described previously’. 
Mutations $22R** and C85K were introduced into UbcH5A using PCR-based 
site-directed mutagenesis and the mutant protein was expressed and purified as 
described for wild-type UbcH5A. A linear fusion of two RNF4 RING domains 
was generated by sub-cloning the first RING domain (RINGI, residues 131-194 
of R. norvegicus RNF4) into pLou3 vector* using Ncol and BamHI restriction 
sites. The second RING domain (RING2, residues 131-194) was inserted using 
BamHI and HindIII restriction sites with a single glycine residue as a linker 
between the two RINGs. The RINGI-RING2 linear fusion was expressed and 
purified as described for wild-type RNF4°. Human ubiquitin (residues 1-76) was 
sub-cloned into pHIS-TEV-30a vector” and expressed in BL21(DE3) E. coli cells 
at 37°C for 4h after induction with 1mM IPTG. His,-tagged ubiquitin was 
purified by Ni-NTA (Qiagen) affinity chromatography and dialysed overnight 
into 20 mM Tris, 150 mM NaCl, pH 8.0. To cleave off the His,-tag, ubiquitin was 
incubated with TEV protease, followed by Ni-NTA affinity chromatography to 
remove any uncleaved Hisg-tagged ubiquitin, the free Hisg-tag and the TEV 
protease (also Hisg-tagged). Purified untagged ubiquitin was then dialysed 
against 50 mM Tris, pH7.5. As a result of cloning, the ubiquitin construct con- 
tains four extra residues at the N terminus (Gly-Ala-Met-Gly) after cleavage with 
TEV protease. 

Preparation of UbcH5A-Ub connected with an isopeptide bond. To generate 
the UbcH5A(S22R/C85K)-Ub conjugate, UbcH5A(S22R/C85K) (200 LM) was 
incubated with Hisg-tagged ubiquitin (200 1M) and Hisg-UBE]1 (1 1M) at 35 °C 
for 26h in a buffer containing 3 mM ATP, 5mM MgCh, 50 mM Tris pH 10.0, 
150 mM NaCl, and 0.8 mM TCEP. Subsequently, imidazole was added to a final 
concentration of 20 mM and the sample was applied onto a Ni-NTA column pre- 
equilibrated with binding buffer (50 mM Tris, 150 mM NaCl, 20 mM imidazole, 
0.5 mM TCEP, pH 7.5). The column was washed with binding buffer and the E2- 
Ub conjugate was eluted with elution buffer (50 mM Tris, 150 mM NaCl, 150 mM 
imidazole, 0.5 mM TCEP, pH 7.5). Elution fractions containing the E2-Ub con- 
jugate were pooled and TEV protease was added to the sample to cleave off the 
Hisg-tag from ubiquitin, followed by overnight dialysis at 4°C against 50 mM 
Tris, 150 mM NaCl, 0.5 mM TCEP, pH 7.5. Subsequently, the sample was passed 
through a Ni-NTA column pre-equilibrated in binding buffer to remove any 
uncleaved E2-"*Ub conjugate and the TEV protease (also Hisg-tagged). A 
flow-through fraction was concentrated and applied onto a HiLoad 16/60 
Superdex 75 gel filtration column (GE Healthcare) pre-equilibrated in 20 mM 
Tris, 150 mM NaCl, 1mM TCEP, pH 7.0. The purified UbcH5A(S22R/C85K)- 
Ub conjugate was concentrated to 5 mg ml ', flash-frozen in liquid nitrogen and 
stored at —80°C. 

Crystallization of the RNF4 RING-UbcH5A(S22R/C85K)-Ub complex. The 
UbcH5A(S22R/C85K)-Ub conjugate was mixed with the linear fusion of two 
RNF4 RING domains in a 2:1 molar ratio and the complex was concentrated 
to 17mg ml !. Proteins were buffer-exchanged into 20 mM Tris, 150 mM NaCl, 
1mM TCEP, pH 7.0 during the concentration step. Crystals were grown at 20 °C 
using the sitting-drop vapour diffusion method by mixing 1 1] of protein complex 
with 1 pl of reservoir solution (18% (w/v) PEG 3,000, 0.1M Tris pH 7.2, 0.2M 
calcium acetate). Crystals appeared after 1 or 2 days and grew to their final size 
within ~5-7 days. Crystals were briefly soaked in a cryoprotectant solution (10% 
(v/v) ethylene glycol, 18% (w/v) PEG 3,000, 0.1 M Tris pH7.2, 0.2M calcium 
acetate) before flash-freezing in liquid nitrogen. 

Data collection and structure determination. Diffraction data were recorded on 
a Rigaku Saturn CCD with X-rays generated from a Rigaku 007 HF generator. 
Resolution of the crystals was limited by our ability resolve the long cell edge due 
to high mosaic spread (approx 1°) and orientation of the crystal. The structure 
was solved by molecular replacement using PHASER” as implemented in the 
CCP4 package*’. A lower resolution (3 A) data set for the heterotrimer was solved 
by finding a single RNF4 RING domain (Protein Data Bank accession 2XEU)’, 
followed by E2 UbcH5A (2YHO)" and ubiquitin (1UBQ, truncated at residue 
R72)”. Interestingly, searching for a second copy of each domain alone did not 
produce a clear solution. Instead searches using the RING dimer, followed by E2, 
ubiquitin and then the E2-ubiquitin conjugate, or RING monomer, then E2, then 
ubiquitin, followed by RING-E2-ubiquitin heterotrimer gave solutions. When a 
higher resolution data set (2.2 A) was obtained, the heterotrimer from the low 
resolution structure was used to solve this data. The models were adjusted manu- 
ally using COOT”, the isopeptide bond and the missing ubiquitin residues were 
clearly visible and built into the model. The model was refined using REFMAC5™, 
NCS restrains were used throughout. MolProbity” was used to correct side-chain 
conformations and as a guide to manual building. The final model has good 
geometry with MolProbity score of 1.42 (99th percentile). 98.6% of residues are 
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in the favoured regions of Ramachandran plot and no residues are in the dis- 
allowed regions. Molecular interfaces were analysed using the PISA server*®. The 
two RING molecules in the crystal are fused together into a single protomer but 
comparison with the native (unfused) dimeric RING domain structure’ shows 
that the arrangement of the domains relative to each other and the contacts 
between them are very similar. For clarity we therefore discuss the dimeric 
RING domain structure in this crystal as if it were formed by two proteins. 
UbcH5A-Ub thioester model. The UbcH5A-Ub thioester model was generated 
from the crystal structure by replacing K85 in UbcH5A with a cysteine using 
COOT”. The N-Ca-CB-Sy dihedral angle was set to 180° (the same conformer as 
in 3PTF"’). The geometry of the model was then minimized by REFMAC™ for 10 
cycles, adding hydrogens at expected positions. Restraints for the thioester link- 
age were generated using JLigand’”. 

Ubiquitination assays. A single-turnover substrate ubiquitination assay for 
RNF4 has been described previously’. Briefly, UbcH5A-Ub thioester was first 
prepared in the absence of RNF4 anda substrate. The charging reaction contained 
100 uM UbcH5A, 120 uM ubiquitin, 0.2 1M His-UBE]1 (E1), 3mM ATP, 5mM 
MgCh, 50 mM Tris, 150 mM NaCl, 0.5 mM TCEP, pH 7.5. Apyrase (4.5 U ml — 4, 
New England BioLabs) was then added to deplete ATP and thus to stop the 
charging reaction. The UbcH5A-Ub thioester (~20 1M) was then mixed with 
RNF4 (0.275 1M) and a substrate (5.5 uM) buffered with 50 mM Tris, 150 mM 
NaCl, 0.5 mM TCEP, 0.1% (v/v) NP40, pH 7.5. A linear fusion of four SUMO2s 
(4 X SUMO2), labelled with iodine-125, was used as a substrate for RNF4. 
Reactions were incubated at room temperature, stopped by SDS-PAGE loading 
buffer and analysed by SDS-PAGE, followed by phosphorimaging. Reactions 
were performed in duplicate and reaction rates are shown as mean = s.d. In assays 
comparing mutant forms of ubiquitin, untagged UbcH5A and untagged ubiqui- 
tin (the construct described above) were used. Hisg-tagged UbcH5A and 
untagged ubiquitin (obtained from Sigma) were used in assays comparing 
UbcH5A mutants. 

Single-turnover autoubiquitination assays contained ~20 4M UbcH5A-Ub 
thioester and either 0.55 .M RNF4 or 1.1.M CHIP* buffered with 50mM 
Tris, 150mM NaCl, 0.5mM TCEP, 0.1% (v/v) NP40, pH7.5. Reactions were 
incubated at room temperature, stopped by SDS-PAGE loading buffer and ana- 
lysed by western blotting with anti-ubiquitin antibody (Dako). 
UbcH5A(C85S)-Ub oxyester hydrolysis assay. UbcH5A(C85S)-Ub oxyesters 
were prepared by incubating UbcH5A(C85S) (100 1M) with ubiquitin (120 11M) 
and His-UBE] (1 1M) in buffer containing 3mM ATP, 5mM MgCh, 50 mM Tris, 
150 mM NaCl, 0.5 mM TCEP, pH 7.5 for ~14 hat 37 °C. Apyrase (4.5 U ml‘) was 
then added to deplete ATP. UbcH5A(C85S)-Ub oxyesters were mixed with RNF4 
(8.8 1M), followed by incubation at room temperature. Reactions were stopped by 
SDS-PAGE loading buffer and analysed by SDS-PAGE. Gels were stained with 
Coomassie blue, scanned using the Odyssey CLx Infrared Imaging System (LI- 
COR Biosciences) and quantified using the LI-COR software. Reactions were 
performed in duplicate and reaction rates are shown as mean = s.d. 

Pull-down assay. Binding between MBP-tagged RNF4 and ubiquitin-loaded 
UbcH5A was analysed by a pull-down assay as described previously’. 

Mass spectrometry. UbcH5A(S22R/C85K) and the UbcH5A(S22R/C85K)-Ub 
conjugate (both 5 1g) were fractionated by 10% SDS-PAGE. Coomassie-stained 
bands were excised and tryptic peptides extracted as described previously”, sub- 
stituting iodoacetamide for chloroacetamide to limit false identifications of ubi- 
quitination sites”. Peptide samples were analysed by LC-MS/MS using a Q 
Exactive mass spectrometer (Thermo Scientific) using high-resolution HCD frag- 
mentation. Peptides were identified and quantified by MaxQuant (v 1.2.2.5) 
running the Andromeda search engine*’ using both a human proteome 
(Human IPI v3.68) and the recombinant protein sequence databases. Both 
Gly-Gly and Leu-Arg-Gly-Gly variable modifications to lysine were included in 
the search to detect ubiquitination by two methods. 
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Observation of interstellar lithium in the 
low-metallicity Small Magellanic Cloud 


J. Christopher Howk', Nicolas Lehner’, Brian D. Fields** & Grant J. Mathews! 


The primordial abundances of light elements produced in the 
standard theory of Big Bang nucleosynthesis (BBN) depend only 
on the cosmic ratio of baryons to photons, a quantity inferred from 
observations of the microwave background’. The predicted”* 
primordial ’Li abundance is four times that measured in the atmo- 
spheres of Galactic halo stars*’. This discrepancy could be caused 
by modification of surface lithium abundances during the stars’ 
lifetimes® or by physics beyond the Standard Model that affects 
early nucleosynthesis”"®. The lithium abundance of low-metallicity 
gas provides an alternative constraint on the primordial abund- 
ance and cosmic evolution of lithium” that is not susceptible to the 
in situ modifications that may affect stellar atmospheres. Here we 
report observations of interstellar “Li in the low-metallicity gas of 
the Small Magellanic Cloud, a nearby galaxy with a quarter the 
Sun’s metallicity. The present-day “Li abundance of the Small 
Magellanic Cloud is nearly equal to the BBN predictions, severely 
constraining the amount of possible subsequent enrichment of the 
gas by stellar and cosmic-ray nucleosynthesis. Our measurements 
can be reconciled with standard BBN with an extremely fine-tuned 
depletion of stellar Li with metallicity. They are also consistent with 
non-standard BBN. 

We obtained high-resolution spectra (R ~ 70,000) of the star Sk 143 
(AzV 456), an O-type supergiant star in the Small Magellanic Cloud 
(SMC), using the Ultraviolet and Visual Echelle Spectrograph 
(UVES)” on the 8.2-m Very Large Telescope (VLT); observational 
details are given in the Supplementary Information. The sight line to 
this star was chosen for observation because it shows significant 
absorption from neutral atoms and molecules'*"* and a weak inter- 
stellar radiation field’, all of which favour the presence of neutral 
lithium (Li). Lil absorption is clearly detected along this sight line 
(Fig. 1). 

The derivation of the total Li/H abundance in the interstellar medium 
(ISM) requires large corrections for ionization, given the column density 
of Li, M(Li) ~ M(Lit) > N(Lit), and for the incorporation of Li into 
interstellar dust grains’. Our first approach to these corrections uses 
observations of adjacent ionization states of other metals, in this case Ca 
and Fe, to estimate the amount of unseen gas-phase lithium. Assuming 
ionization balance and only atomic processes, we have the ratios N(Li11)/ 
N(Lil) « N(Cat)/N(Ca1) or N(LitD/N(Lil) o N(Fe11)/N(Fe1), where 
the constants of proportionality involve the ratios of ionization rates and 
recombination coefficients for the elements in question’*”’. The ratio of 
7Lit to total hydrogen in the SMC is log[N(Li 1)/N(H)] = -11.17 + 0.04 
(all uncertainties are 1o unless noted), where N(H) = N(H1) + 2N(H,). 
Applying ionization corrections derived from Ca and Fe yields 
logarithmic abundances A(Li) = log{N(Li)/N(H)] + 12=2.79+0.11 
(from Ca) and 3.01 + 0.12 (from Fe). These calculations do not include 
more complicated (and uncertain) effects such as grain-assisted 
recombination’, nor do they correct for dust depletion. 

Our second approach uses the observation’® that NCLiD/N(K)) 
along sight lines through the Milky Way is nearly constant (with 
new determinations giving consistent results'*”°). When a differential 


ionization correction is applied, ’Li/K in the Milky Way ISM is 
consistent with the Solar System ratio. Thus, Li and K appear to have 
very similar ionization and dust depletion behaviours, and 7Li/K1 
gives a good measure of the total (gas+dust phase) ’Li/K (refs 16, 19 
and 20). We measure logiNCLi 1)/N(K1)] = —2.27 + 0.03 in the 
SMC, in agreement with the Galactic relationship”. Applying an 
ionization correction of +0.54+ 0.08dex (refs 16 and 17) gives 
logiN(Li)/N(K)] = —1.78+0.09. With the Solar System ratio 
log (7Li i K) = —1.82+0.05 derived from meteorites*!, we find 
[’Li/K] guc = log[N(/Li)/N(K)] — log(7Li/K) 5. The ratio of "Li to 
metal nuclei in the SMC is consistent with that found in the Solar 
System and the Milky Way ISM"*: (’Li/K) syc™(7Li/K) 9: 

Although the ionization and depletion characteristics of S1 are not 
as well tied to those of Lil (ref. 17), a similar approach using S1 yields 
(’Li/S]smc = —0.26 + 0.11. The sub-solar ratio is consistent with a 
modest (0.3 dex) depletion of Li and K onto dust in the ISM"? relative 
to S. 

We estimate A(’Li) by scaling “Li/K to Li/H: A(7Li)gyc= 
A(Li) 9 + [Fe/H] gmc + [K/Felouc + [’Li/K]gyc- We adopt [/Li/K] sc 
from above, the meteoritic A(’Li). =3.23+0.05 (ref. 21), with a 
mean present-day SMC metallicity [Fe/H]smc = —0.59 + 0.06 and 
an SMC K/Fe abundance [K/Fe]syjc = +0.00 + 0.10 (these last two 
are discussed in the Supplementary Information). This yields 
A(’Li)smc = 2.68 + 0.16. Similarly scaling the ’Li/S result gives 
2.38 + 0.17. 

Most previous observational constraints on the primordial Li 
abundance have relied on measurements of atmospheric abundances 
in low-metallicity Galactic stars. Our detection of interstellar lithium 
beyond the Milky Way opens a new window on the lithium problem. 
Although there are significant uncertainties associated with ionization 
and dust effects, as demonstrated by the significant spread in A(Li)smc 
values, these are largely independent of the uncertainties that might 
affect stellar measurements of the primordial lithium abundance. Our 
recommended absolute abundance is A(’Li)smc = 2.68 + 0.16, or 
(“Li/H) suc = (4.8 + 1.8) X 101°, derived from 7Li/K. This is com- 
pared to stellar “Li abundances*” at different metallicities in Fig. 2. 
Our best estimate overlaps the prediction from standard BBN using the 
baryonic density deduced from the five-year Wilkinson Microwave 
Anisotropy Probe (WMAP) data’, A(‘Li) = 2.72 + 0.06 (95% confid- 
ence level; ref. 3), although this leaves little room for the post-BBN 
chemical evolution”, that is, the contribution of freshly synthesized 
Li to the ISM by stellar and cosmic ray nucleosynthesis (see represent- 
ative models” in Fig. 2). Our estimate of A(‘Li)smc is also consistent 
with the upper envelope of Li abundances in Milky Way thin-disk stars 
(Fig. 2)°2. 

However, given the uncertainties in scaling to A(’Li)sqc the stronger 
result is our measurement of [’Li/K] sc = +0.04 + 0.10. We compare 
[’Li/K] smc with measurements?!” of [’Li/Fe] and chemical evolution 
models” in Fig. 3. The stars show a rapid decrease in [’Li/Fe] with 
increasing metallicity until [Fe/H] ~ —1, at which point the Li abund- 
ance increases roughly in lockstep with Fe such that disk stars have a 
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Figure 1 | Interstellar absorption by several neutral species seen towards the 
star Sk 143. Normalized interstellar absorption profiles from UVES plotted 
versus the Local Standard of Rest velocity, visp, and profile fit of the Lil 
absorption. The empirically determined signal-to-noise ratio is about 275 per 
pixel (5 pixels per resolution element) for the Lil observations. The full set of 
optical and ultraviolet absorption profiles seen towards this star and the column 
densities measured from these are given in the Supplementary Information. 
b, The profiles of Li, K1, and Fe1; the SMC cloud bearing Lil at 

Visp © +121 kms‘! is marked with the dashed line. The thicker grey regions 
near Lil are possibly contaminated by diffuse interstellar bands or residual 
fringing, which may extend into the region containing Li absorption. The 
effects on the “Lit columns are within the quoted uncertainties. The Lit 
absorption is composed of (hyper)fine structure components of both ’Lit and 
°Lir (shown, respectively, by the green and blue ticks in the top panel of a). The 
strong line of “Lit is detected with approximately 16¢ significance in the ISM of 
the SMC. A model fit to the Lil absorption complex is shown in a (see 
Supplementary Information), with the difference between the data and the fit, 
6, shown immediately below (normalized to the local error array). The free 
parameters for the fit are the polynomial coefficients for the stellar continuum, 
the central velocity, Doppler parameter (b-value), and column densities of “Lit 
and °Lit for the interstellar cloud. The red curve shows the best-fitting model 
including both ’Lit and SLi, which are shown in green and blue, respectively. 
The best-fit isotopic ratio is NCLi1)/N(Lit) = 0.13 + 0.05 (68% confidence 
limit), consistent with the presence of °Li along the sight line, although below 
the 3o detection threshold. 


nearly constant [Li/Fe] ratio, similar to that found in the Solar System. 
Our measurement of the present-day ’Li-to-metal ratio in the SMC is in 
agreement with the nearly constant values found in the atmospheres of 
Milky Way disk stars (— 1< [Fe/H] <0), most of which formed over 
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Figure 2 | Estimates of the lithium abundance in the SMC interstellar 
medium and in other environments. Our best estimate for the interstellar 
(gas + dust phase) abundance of A(’Li) in the SMC (red circle) is derived from the 
7Lil/K I ratio. The present day metallicity of the SMC from early-type stars is [Fe/ 
H] = —0.59 + 0.06. (All uncertainties are 1o.) The point marked BBN and the 
dotted horizontal line show the primordial abundance predicted by standard 
BBN®. The green curves show recent models” for post-BBN ’Li nucleosynthesis 
due to cosmic rays and stars. By adjusting the yields from low-mass stars, the 
models are forced to match the Solar System meteoritic abundance”! (see 
Supplementary Information). The solid and dashed lines correspond to models A 
and B**, which include (A) or do not include (B) a presumed contribution to Li 
from core-collapse supernovae. The blue hatched area shows the range of 
abundances derived for Population II stars in the Galactic halo®, with the ‘Spite 
plateaw’ in this sample at A(‘Li)pop 1 ~ 2.10 + 0.10 (ref. 6). The violet hatched 
area shows the range of measurements seen in Galactic thin-disk stars, and the 
thicker violet lines denote the six most Li-rich stars in a series of eight metallicity 
bins”. The selection of thin-disk stars includes objects over a range of masses and 
temperatures, including stars that are expected to have destroyed a fair fraction of 
their Li. Thus, the upper envelope of the distribution represents the best estimate 
of the intrinsic ISM Li abundance at the epoch of formation for those stars, and 
the thicker hatched area for the thin-disk sample is most appropriate for 
comparison with the SMC value. The most Li-rich stars in the Milky Way thin 
disk” within 0.1 dex of the SMC metallicity give A(’Li) mitey way = 2.54 + 0.05, 
consistent with our estimate of A(’Li)smc = 2.68 + 0.16. 


4 billion years ago, with the Solar System and the modern-day Milky 
Way ISM**. 

Both the thin-disk stars and our SMC measurements are below 
standard BBN predictions with reasonable assumptions about post- 
BBN production, although it is often assumed these stars have had 
significant depletion of their surface Li abundance”. Taken at face 
value, the consistency of our SMC measurement with the [’Li/Fe] 
for those stars calls this assumption into question. Although the 
models in Figs 2 and 3 are imprecise given the uncertain Li yields from 
stellar sources, they illustrate the tension between standard BBN pre- 
dictions and our measurements if there is any post-BBN Li production. 
This tension can be relieved if a metallicity-dependent depletion of Li 
in stellar atmospheres is fine-tuned in such a way that it is very strong 
below [Fe/H] ~ [Fe/H]syic = —0.6 (to create the Spite plateau and 
avoid overproducing Li in the SMC ISM) and negligible at or above 
the SMC metallicity, thus conspiring to create a constant [’Li/Fe] ratio 
above [Fe/H] ~ —1. Alternatively, non-standard BBN scenarios can 
be invoked to allow for a lower primordial Li abundance*”’. 

If non-standard Li production occurs in the BBN epoch, many such 
models predict excess °Li compared with the standard BBN. The only 
known source of post-Big Bang °Li is production via cosmic ray inter- 
actions with ISM particles. Excess °Li at the metallicity of the SMC 
would support non-standard production mechanisms, either in the 
BBN epoch” or through the interaction of pregalactic cosmic rays with 
intergalactic helium’*. Measurements of °Li in stellar atmospheres are 
extremely difficult because the stellar line broadening is well in excess 
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Figure 3 | Estimates of Li/Fe in the SMC interstellar medium and in several 
different environments. The SMC value is derived from the ’Li1/K1 ratio. At 
low metallicities ([Fe/H] < — 1), stellar measurements’ trace the build-up of Fe 
with a constant Li abundance along the Spite plateau. At higher metallicities, 
disk star abundances” show a turnover to roughly constant [’Li/Fe] at values 
consistent with the Solar System meteoritic value’' (shown as the dash-dotted 
black line at [’Li/Fe] = 0). Our SMC estimate is consistent with the Solar 
System and disk star abundances in this region of relatively constant ’Li/Fe 
abundances, with [’Li/Fe]smc = +0.04 + 0.14 for [K/Fe]smc = 0.0 + 0.10 
(Supplementary Information). The most Li-rich disk stars within 0.1 dex of the 
SMC metallicity have a mean [’Li/Fe] = —0.13 + 0.05. (All uncertainties are 
1c.) The green curves show the chemical evolution models” as in Fig. 2, 
whereas the dotted line shows the behaviour of [’Li/Fe] for the standard BBN 
primordial abundance with no subsequent evolution of ’Li. The relative 
uniformity of the stellar Li/Fe abundances at [Fe/H] 2 — 1 could be caused by 
a delicate balance of Li and Fe production and metallicity-dependent depletion 
of the surface Li abundances (not ruled out given the changes in mean age and 
mass potentially present in the sample”). However, the agreement of the [’Li/ 
Fe] ratio seen in these old stars (ages exceeding 4 billion years”’) and in the 
present-day interstellar medium of the SMC suggests little change in the stellar 
abundances for metallicities [Fe/H]~ — 0.6 up to the solar metallicity. To bring 
the stellar and SMC interstellar abundances into agreement with standard BBN 
predictions requires a delayed injection of significant ’Li from stellar 
production mechanisms as well as vigorous depletion of stellar surface ’Li 
abundances at metallicities just below that of the SMC. 


of the isotope shift. However, the ’Li1 doublet is well separated in our 
data owing to the very low broadening in the cool ISM probed by Lil 
absorption. Our best fit to the SMC Lir absorption gives 
(CLi/’Li)syjc = 0.13 0.05 (see Supplementary Information and 
Fig. 1), giving a formal limit to the isotopic ratio in the SMC of 
(SLi/’Li)oyic < 0.28 (30). With higher signal-to-noise ratios and reso- 
lution it should be possible to lower the limits for the interstellar 
isotope ratio in the SMC to provide constraints on non-standard 
BBN models. This approach has the advantage that ionization and 
dust-depletion effects are not important for comparing the two iso- 
topes of Li (ref. 27), making °Li/’Li a powerful diagnostic of nucleo- 
synthesis and non-standard evolution of Li abundances. 
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No meridional plasma flow in the heliosheath 


transition region 


Robert B. Decker!, Stamatios M. Krimigis'?, Edmond C. Roelof! & Matthew E. Hill! 


Over a two-year period, Voyager 1 observed a gradual slowing- 
down of radial plasma flow in the heliosheath to near-zero velocity’ 
after April 2010 at a distance of 113.5 astronomical units from the 
Sun (1 astronomical unit equals 1.5 x 10° kilometres). Voyager 1 
was then about 20 astronomical units beyond the shock that 
terminates the free expansion of the solar wind and was immersed 
in the heated non-thermal plasma region called the heliosheath. 
The expectation from contemporary simulations~’ was that the 
heliosheath plasma would be deflected from radial flow to 
meridional flow (in solar heliospheric coordinates), which at 
Voyager 1 would lie mainly on the (locally spherical) surface called 
the heliopause. This surface is supposed to separate the heliosheath 
plasma, which is of solar origin, from the interstellar plasma, which 
is of local Galactic origin. In 2011, the Voyager project began 
occasional temporary re-orientations of the spacecraft (totalling 
about 10-25 hours every 2 months) to re-align the Low-Energy 
Charged Particle instrument on board Voyager 1 so that it could 
measure meridional flow. Here we report that, contrary to expecta- 
tions, these observations yielded a meridional flow velocity of 
+3 + 11kms‘, that is, one consistent with zero within statistical 
uncertainties. 


Unexpected transition region 
Measured plasma flow 


Vz =-6.5+4kms* 
V,=-26+4 kms 
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Heliopause — — Termination shock (~90 au) 


Figure 1 | Heliospheric plasma boundaries and flow regions relative to the 
location of Voyager 1. The solar wind flows initially radially outwards from 
the Sun, and in the outer heliosheath its expected meridional deflection 
becomes parallel to the heliopause. The interstellar flow is from the left in the 
image; it should be deflected around the heliosheath in the region beyond 
Voyager 1. Voyager 1 is shown in its own meridional plane (in solar 
heliospheric coordinates) within the unexpected transition region that it first 
encountered at a helioradius of ~113 Au (ref. 1; unit vectors of the heliospheric 
RTN system are indicated). It had been expected that heliosheath plasma flow 
near the heliopause would have a near-zero radial component and that its 
meridional component Vy would be a significant fraction of 25 km s 1, to be 
consistent with the distant speed of the local interstellar plasma and its 
deflection around the heliosheath. However, from the data taken during five 
rolls of Voyager 1, we have determined that (Vg) = —14+ 14kms_' and 
(Vx) = +3 + 11kms~*. We conclude that the roll data taken at Voyager 1 are 
statistically consistent with Vj = 0. Figure adapted from an image online on the 
Voyager website at the Jet Propulsion Laboratory (http://voyager.jpl.nasa.gov/ 
news/new_region.html). 


The discovery by Voyager1 of the zero radial velocity’ of 
heliosheath plasma flow beyond ~113.5 astronomical units (AU) in a 
previously unsuspected transition region (Fig. 1) led to the suggestion 
that the initially radial flow in the heliosheath was already being 
deflected polewards (towards meridional flow), as predicted by typical 
magnetohydrodynamic models’. The suggested meridional flow 
could not be measured by the Low-Energy Charged Particle instru- 
ment in the usual orientation of its scanning plane on board Voyager 1, 
so starting in March 2011 the spacecraft was commanded to rotate 
about its Earth-pointing axis for about one day every second month to 
enable the instrument to measure flow speeds in the meridional or 
R-N plane. (In the RTN coordinate system, R is the radius vector from 
the Sun, T is in the direction of solar rotation and N completes a right- 
handed system.) Measurements from five such rolls performed during 
2011/066-2012/030 have been analysed so far (date notation is 
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Figure 2 | Comparison of radial and azimuthal components of heliosheath 
plasma flow velocity at Voyager 2. a, Crosses, daily averaged values of Va 
measured by the plasma instrument (PLS) during 2011/001-2012/035. Circles, 
5-d-averaged determinations of Vp using the Fourier fit procedure on 28-43- 
keV ion angular data from the Low-Energy Charged Particle instrument 
(LECP). Vertical error bars are Poisson statistical uncertainties (+ 1c) about the 
mean. During the period shown, Voyager 2 moved from helioradius 94.2 au to 
helioradius 97.4 au and from heliolongitude 29.3° S to heliolongitude 29.8° S. 
b, Crosses, daily-averaged values of V; measured by the PLS. Diamonds, 5-d- 
averaged determinations of V7’ using the Fourier fit procedure on ion angular 
data. Vertical error bars are Poisson statistical uncertainties (20) about the 
mean. There is generally good agreement between the measured solar wind 
components and those determined from fits to the low-energy ion angular 
measurements. The two shaded periods show where angular data on ions in 
several energy channels of the LECP allow us to identify relatively large non- 
convective anisotropies consistent with + T-directed streaming along the 
average azimuthal orientation of the magnetic field in the heliosheath. 
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year/day of year). We report here the absence of a statistically signifi- 
cant persistent N component of flow, with a cumulative average velo- 
city over the five roll periods of (Vx) = +3 = 11kms_ a Longer-term 
averages of the R and T components of flow during 2011/066-2012/ 
030 in the usual instrument orientation yielded a sunward radial 
velocity of Vp = —7+4kms ' and a persistent negative azimuthal 
velocity of Vp = —26+4kms_|. 

We test the null hypothesis for convective flow in the N direction, 
that is, that the meridional component (Vy) of any convective plasma 
flow within the transition region is statistically consistent with zero. 
Thus, we make the simplest assumption of convective flow in our most 
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sensitive energy channel (ions with energies of 53-85 keV) and com- 
pute the velocity implied by the angular distribution of ion intensities. 
If that velocity is consistent with zero within our estimated errors, then 
we conclude that there is no measureable Vy. 

The Low-Energy Charged Particle telescope samples the anisotropy 
of the energetic ion intensity in seven positions spaced by 45° in its 
scan plane; one sector was intentionally blocked. The counting rates 
C(@,,) in the seven usable sector positions (m = 1, 2,...,7) overdeter- 
mine the first five coefficients in the Fourier expansion 


C(p) = Co(1 +.Ay cos (p) + By sin (fp) + Az cos (26) +B sin(2#)) (1) 
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Figure 3 | Roll measurements and derived components of heliosheath flow 
velocity. a, Top, orientation of heliospheric (RTN) and instrument-associated 
(RT'N) axes at the location of Voyager 1 in the usual spacecraft configuration. The 
LECP scans in the R-T’ plane (shaded grey). The N’—T” plane is rotated about the 
Raxis by 20° relative to the N-T plane. Bottom, view along the +N’ axis showing 
sector positions in the R-T’ plane (sector 8 is blocked) and the first-order 
anisotropy angle #. b, Top, orientation of axes in rolled spacecraft configuration. 
Bottom, view along the +T = +N’ axis showing sector positions in the R-N 
plane. c-g, Count rates versus sector in rolled configuration for five roll periods. 
Solid symbols are roll-averaged count rates of 53-85-keV ions, predominantly 
protons, based on independent composition measurements. Vertical error bars 
are Poisson statistical uncertainties (+1¢) about the mean. Horizontal bars 
indicate sector angular width. The red curve is a least-squares fit to data of the 
function in equation (1). The blue curve is the first-harmonic component of the 
fit. h, Intensity of 53-85-keV protons. The blue and black traces are respectively 
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2010.0 
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the intensities in sectors 1 and 5 of particles arriving from the sunward (sector 1) 
and anti-sunward (sector 5) directions in the usual spacecraft configuration. The 
helioradius of Voyager 1 is given along the top axis. The dashed vertical lines 
indicate the roll periods in panels c-g. i, 26-d-averaged plasma flow velocity 
components Vp and V+ (roll periods not included). Vertical error bars are 
Poisson statistical uncertainties (2c) about the mean. For comparison, the average 
value of the solar wind speed is ~400 kms". j, Expanded view of panel 

i including the five roll determinations of Vx (diamonds). Vertical error bars are 
Poisson statistical uncertainties (20) about the mean. The first dot-dash vertical 
line shows the onset, at ~2010/133, of a 208-d (2.05-aU) stretch of zero Vp; the 
second dot-dash vertical line, at ~2010/341, marks the end of the steady zero-Vp 
flow and the transition to variable and often negative- Vx flow. Anisotropies in the 
usual configuration after ~2010/341 show non-convective features in sectors 6 
and 7, consistent with —T-directed streaming along the average azimuthal 
orientation of the magnetic field in the heliosheath. 
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Table 1 | Voyager 1 roll periods, fit coefficients and heliosheath plasma flow velocities 


Roll period No. rolls Co (X107! cnts~!) A, (X107?) B, (x10) &,(X107)t $1 OL y Ve (kms~ +) Vu (kms~+) 
1: 2011/066-073 6(21h)* 1.97 + 0.05 (890)+ 6.54 + 3.78 0.20+2.94 654+3.77  159+26 1.55 26.0 + 25.8 16.321.7 
(March 2011) (28,858 s)8 

2: 2011/121-131 7 (25h) 1.65 + 0.04 (819) 3.81 + 4.03 437+3.13 580+3.54 206+36 1.52 —-198+278 +184+23.3 
(May 2011) (34,474s) 

3: 2011/207-217 6 (20h) 1.58 + 0.05 (631) 2.53 + 4.65 467+3.62 532+388 219+48 148 —-12.7432.6 +24.2+274 
(July 2011) (27,8218) 

4: 2011/302-307 6 (10h) 1.32 + 0.07 (254) —3.88+7.87 0.95+613 4.00+7.78 144+90 1.44 6.5 + 56.1 17.3+47.1 
(October 2011) (13,306 s) 

5: 2012/016-030 7 (10h) 1.74 + 0.07 (343) —-1.24+618 242+473 2.73+506 94+124 1.40 +15.5+44.0 -20.2+37.0 
(January 2012) (13,824 s) 

Time-weighted — —_— _— — — 14.0+13.6|| +2.8+11.4 
average 


* Number of hours of data taken during rolled configuration that were used in (Vp, Vx) analysis. 


+ Mean number of counts per sector for the seven active sectors of the Low-Energy Charged Particle instrument. 


+ First-order anisotropy amplitude ¢, = (A> + B,2)/2 


and associated azimuth angle ¢; = tan” 1(B,/Aj) (Fig. 3a). 


§ Total data accumulation time of data used in flow velocity determination; used to perform weighted average in row 6. 
||Mean values of Vz and V+ determined from data taken during 2011/066-2012/030 are Vp = -6.5+4.1 km sland Vr = —25.8+3.8kms ?. 


A least-squares solution yields the amplitudes and phases of the first 
two harmonics. We assume that ions have an isotropic intensity j « 
E” in a frame moving with the heliosheath flow. The instrument 
measures the spectral slope () in adjacent energy (E) channels. The 
well-known theory of the Compton-Getting effect* relates the compo- 
nents of the convective flow (Vp, Vn) in the scan plane to the coeffi- 
cients of the first harmonic anisotropy through the spectral slope and 
ion speed (v): 


A, =2(y + 1)(Vr/¥), By =2(y + 1)(Vn/V) (2) 


We have calibrated our fitting procedure by comparison with the Vz 
and V; components of plasma flow measured by Voyager 2 in the 
heliosheath using the Plasma Science instrument’ (Fig. 2), which 
directly measures the solar wind velocity. The Plasma Science instru- 
ment on Voyager | failed in 1980. The comparison in Fig. 2 shows 
that, with the exception of periods of weak field-aligned ion streaming 
that we can readily identify in both the Voyager 2 and Voyager 1 data, 
the velocities derived from directional intensities of low-energy ions 
by the method described above is able to reproduce the solar wind 
velocity components quite well in the Low-Energy Charged Particle 
Instrument’s scan plane. This justifies a posteriori our assumption 
that the particle distribution function is essentially isotropic in the 
plasma frame. 

Orientations of the scan plane in its usual and rolled configuration 
are shown in Fig. 3a, b. The results of our Fourier analysis of the five 
Voyager 1 roll periods are presented in Fig. 3c-g. Figure 3h shows the 
intensities of 53-85-keV heliosheath protons arriving from the 
sunward and anti-sunward directions. The roll period dates, numbers 
of rolls per period and fit coefficients (Cy, A, and B,) are given in 
columns 1-5 of Table 1. The errors in Cp, A, and B, are determined 
by propagating the Poisson statistical uncertainties in the sectored 
counting rates, which are shown as vertical error bars (1a) about 
the mean in Fig. 3c-g, using the equations that express Co, A, and B, as 
functions of the sectored rates. Alternative representations of A, and 
B, in terms of the first-order anisotropy amplitude €, and azimuth ¢, 
are given in columns 6 and 7 of Table 1, and the spectral power-law 
index y is in column 8. The plasma convection velocity components Vp 
and Vy (columns 9 and 10) implied by the first harmonic are given 
along with their uncertainties, which are calculated by propagating the 
errors in fit coefficients Cp, A, and B, using equation (2). The time- 
weighted averages of Vp and Vy over all five rolls are summarized in 
row 6 of those two columns. The five determinations of Vy in the 
rolled configuration are plotted in Fig. 3j along those of Vg and V, 
the latter two components calculated using 26-d-averaged ion angular 
data taken in the usual (unrolled) configuration. 

The averages of the velocity components and their uncertainties 
derived from the five roll periods are (Vg) = —14+ 14kms7! and 
(Vx) = +3 + 11kms_'. The negative mean radial velocity during the 
rolls is consistent within errors with the more statistically significant 
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result (—6.5 + 4.1 kms ') from the 26-d averages spanning the five 
roll periods in Fig. 3i, j (filled circles). The 26-d averages of the 
azimuthal flow (V-) have a larger and statistically significant negative 
value (—25.8 + 3.8kms_!). Although azimuthal flow is not the topic 
of this report, we do not wish its clear signature to pass unnoticed. 

We offer several arguments that the time-weighted average for Vy 
over five rolls is statistically consistent with zero, and moreover that it 
is small in an absolute sense. First, if our measurements of Vy are 
consistent with zero, we would expect that roughly half of our roll 
period measurements would have Vy>0O and half would have 
Vn <0. Over the five spacecraft rolls, two had Vy > 0 and three had 
Vy <0. Second, the Poisson error bars for each roll always bracket 
zero, giving no indication of a systematic non-zero flow. Third, as the 
Poisson distribution can be approximated by a Gaussian because the 
number of counts accumulated in each sector exceeds 250 (Table 1, 
column 3), our five-roll result (Vy) = +3 +11kms7! implies only a 
16% probability that (Vx) exceeds +14km s_'. For comparison, the 
distant upstream flow velocity of the local interstellar medium is 
~25kms '. The solar radial vector to Voyager 1 is ~30° offset from 
the upstream flow direction, that is, from the expected ‘nose’ of the 
heliosheath. At this angle, most steady-state models show a positive 
meridional flow that within the heliosheath is a significant fraction of 
the distant upstream value or even exceeds it (because of the constric- 
tion of plasma streamlines as they divert around the heliosheath). Our 
results give us 84% confidence that (Vy) is less than half of the distant 
upstream flow, even though our error bars are larger than our mean 
velocity (3 + 8kms_ '). This is a drastic difference from the steady- 
state predictions. 

We therefore conclude from our values (3kms )<25kms'') that 
Voyager 1 is not at present close to the heliopause, at least in the form 
that it has been envisioned up to now. In fact, it has been in the transition 
region of weak radial (Vx) flow for over two years now (Fig. 3j), during 
which time it travelled an additional 7.5 au outwards from the Sun. We 
do not know how much farther outwards the transition region extends, 
and the longer it lasts in time, the less likely it is to be dominated by a 
temporal effect of the expansion and contraction of the heliopause 
during the 11-year solar activity cycle’. However, a non-stationary solar 
wind should be included in any realistic model. In any case, any theories 
that predict a meridional flow velocity significantly outside of the 
Voyager 1 statistical limits (—8 kms '<(Vy) < 14kms_') should be 
reassessed, perhaps necessitating a new theoretical formulation of the 
interaction of the solar wind with the local interstellar medium. 
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Flexible metal-oxide devices made by room- 
temperature photochemical activation of sol-gel films 
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Yong- Young Noh® & Sung Kyu Park? 


Amorphous metal-oxide semiconductors have emerged as poten- 
tial replacements for organic and silicon materials in thin-film 
electronics. The high carrier mobility in the amorphous state, 
and excellent large-area uniformity, have extended their applica- 
tions to active-matrix electronics, including displays, sensor arrays 
and X-ray detectors’. Moreover, their solution processability and 
optical transparency have opened new horizons for low-cost print- 
able and transparent electronics on plastic substrates* '*. But metal- 
oxide formation by the sol-gel route requires an annealing step at 
relatively high temperature”’*”, which has prevented the incorp- 
oration of these materials with the polymer substrates used in high- 
performance flexible electronics. Here we report a general method 
for forming high-performance and operationally stable metal-oxide 
semiconductors at room temperature, by deep-ultraviolet photo- 
chemical activation of sol-gel films. Deep-ultraviolet irradiation 
induces efficient condensation and densification of oxide semicon- 
ducting films by photochemical activation at low temperature. This 
photochemical activation is applicable to numerous metal-oxide 
semiconductors, and the performance (in terms of transistor mobility 
and operational stability) of thin-film transistors fabricated by this 
route compares favourably with that of thin-film transistors based 
on thermally annealed materials. The field-effect mobilities of the 
photo-activated metal-oxide semiconductors are as high as 14 and 
7cm? V~'s_' (with an Al,O; gate insulator) on glass and polymer 
substrates, respectively; and seven-stage ring oscillators fabricated 
on polymer substrates operate with an oscillation frequency of more 
than 340kHz, corresponding to a propagation delay of less than 
210 nanoseconds per stage. 

During recent decades, solution-processed organic and inorganic 
semiconductors have been intensively investigated for realizing large- 
area flexible and printed electronics by continuous-solution pro- 
cesses''!*, Nevertheless, organic semiconductors still suffer from 
operational instability, and have relatively low carrier mobility for 
high-end applications. Some inorganic materials are too reactive to 
control in ambient conditions, and thus have had limited scope for 
large-scale fabrication. Recently, amorphous or polycrystalline metal- 
oxide semiconductors have been proposed as alternative channel 
materials, because they exhibit excellent optical transparency and good 
thin-film transistor (TFT) performance in ambient conditions*'*””. 
Wet chemical, ‘sol-gel’ methods can be used to form high-quality 
oxide films, but such methods typically require a high-temperature 
annealing step, which is not compatible with conventional polymer 
substrates. Thus, for the full realization of flexible, large-scale, solu- 
tion-processed metal-oxide electronics, it is important to understand 
the chemistry involved in sol-gel metal-oxide formation, and to apply 
this knowledge to the low-temperature synthesis of metal-oxide semi- 
conducting films that are compatible with flexible polymer substrates 
and open-chamber, continuous processes. 


We have developed a new photo-annealing method for forming 
amorphous metal-oxide semiconductors, and have examined its viability 
for producing large-area uniform devices and integrated circuits on 
polymer substrates. We use photochemical activation induced by 
deep-ultraviolet (DUV) light from a low-pressure mercury lamp in an 
inert atmosphere (to prevent reactive ozone formation) to achieve high 
degrees of sol-gel condensation and film densification in amorphous 
metal-oxide semiconductor systems including indium gallium zinc 
oxide (IGZO), indium zinc oxide (IZO) and indium oxide (In,QO3). 
Our results suggest that DUV-assisted metal-oxide formation is a 
general route to prepare high-performance, solution-processed 
metal-oxide semiconductor films with only small amounts of extra 
heat supplied, permitting the use of thermally sensitive substrate 
materials. 

To explain the formation of high-quality sol-gel semiconductor films 
by DUV irradiation, we propose the following mechanism, based on 
experimental data from ultraviolet-visible absorption spectroscopy, 
X-ray photoelectron spectroscopy, high-resolution transmission elec- 
tron microscopy (HRTEM), Rutherford backscattering spectrometry 
and ellipsometry (Fig. 1 and Supplementary Fig. 1). When metal 
precursors for IGZO films are dissolved in 2-methoxyethanol 
(2-ME), and the resultant precursor solution is stirred at 75 °C for more 
than 12h, a ligand exchange reaction occurs from nitrate/acetate to 
2-methoxyethoxide or hydroxide, and condensation of metal alkox- 
ides/hydroxides proceeds to form a partial network of metal-oxygen- 
metal (M-O-M) bonds in the solution. The as-spun films (25-35 nm 
thick) before DUV irradiation still contain a significant amount of 
residual organic components, as confirmed by a high carbon content 
in the film (Fig. 1b). Subsequently, when the as-spun film is exposed to 
DUV irradiation from the mercury lamp (main peaks at 184.9nm 
(10%) and 253.7nm (90%)) under nitrogen purging, high-energy 
DUV photons induce photochemical cleavage of alkoxy groups, and 
activate metal and oxygen atoms to facilitate M—O-M network forma- 
tion (Fig. 1a, step 1, condensation). The efficiency of these DUV-assisted 
initial cleavage and condensation reactions is indicated by the rapid 
decrease of oxygen and carbon contents in the first 30 min of irra- 
diation. Further irradiation induces a gradual removal of oxygen and 
carbon (and, thereby, near-complete condensation) and a transition to 
film densification (step 2, densification). 

The degree of film densification is confirmed by comparing the 
areal densities and thicknesses of photo-annealed (P) and high- 
temperature, thermally annealed (T) IGZO films: 52.88 x 10° (P) 
versus 52.43 X 10'°(T) atomscm ” from Rutherford backscattering 
spectrometry; and 7.1-9.70(P) versus 7.1-10.26(T)nm from 
HRTEM (lower limit) and ellipsometry (upper limit) measurements 
(Supplementary Fig. 1). Also, the atomic binding states, such as M-O 
bonding, in the photo-annealed film are similar to those in the thermally 
annealed film (Fig. 1c and Supplementary Fig. 2). We speculate that 
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Figure 1 | Photo-activation of solution-processed metal-oxide 
semiconductors by DUV. a, Schemes showing condensation mechanism of 
metal-oxide precursors by DUV irradiation (hv). Light-blue shading denotes 
illumination from the low-pressure mercury lamp (blue cylinders). b, Atomic 
composition ratios of IGZO thin films as a function of DUV irradiation time. 
c, Red curves are X-ray photoelectron spectra (O(1s) peak) of as-spun, photo- 
annealed and thermally annealed IGZO films. (Deconvolution of the spectra 
shows the contributions of peaks at ~530.0 (green), ~531.0 (blue) and 


such a high-degree of densification after 60 min is enabled by decom- 
position of organic residues (solvent molecules and residual alkoxy 
groups) by DUV-assisted photolysis and reorganization of M-O-M 
networks. The latter process is promoted by photochemical cleavage 
and rearrangement of disordered M-O-M networks without high- 
temperature annealing””**. 

We have discovered that the DUV irradiation in our setup is accom- 
panied by unintentional heating of the films up to ~150 °C (from the 
radiant heat of the lamp), and this temperature is maintained even after 
prolonged DUV irradiation (>120 min, >180-201 J cm’ 7; Supplemen- 
tary Fig. 3a). For comparison, metal-oxide films annealed at 150°C 
without DUV treatment, or cooled on a cooling stage (40-70 °C) with 
DUV irradiation, showed almost no or low electrical performance, 
respectively (Fig. 2b and Supplementary Fig. 3d). All of these observa- 
tions imply that near-complete condensation and densification of films 
requires both DUV photo-activation and the unintentional moderate 
heating. We suppose that this moderate heating provides extra thermal 
energy for the removal of volatile organic residues (2-ME has a boiling 
point of 124°C), and for M-O-M network reorganization via efficient 
condensation and subsequent densification. Additionally, the measured 
optical transmittance and band gap of photo-annealed oxide films are 
very close to those of oxide films annealed at 350 °C (Fig. 1d, inset), and 
there is no apparent indication of DUV-induced metallic reduction”’. 

Figure le shows ultraviolet—visible absorption spectra of precursor 
solutions for IGZO film preparation. For comparison, the absorption 
spectra of neat solvent (2-ME) and individual metal (In, Ga, Zn) pre- 
cursor solutions are also shown. Unlike 2-ME, which shows minimal 
absorption at wavelengths of 225-350nm, the solutions of 


Wavelength (nm) Wavelength (nm) 


~532.0 eV (purple) from, respectively: oxygen atoms in M-O-M lattice; 
oxygen atoms near oxygen vacancies and in M-OC bonds; and oxygen atoms in 
M-OH compounds.) d, Optical transmittance (main plot) and bandgap 
(insets) of thermally annealed and photo-annealed IGZO films on glass 
substrates. As shown by the tangent lines in the insets, the bandgaps of the 
photo-annealed and thermally annealed films are 3.23 and 3.22 eV, 
respectively. x, absorption coefficient. e, Light absorption characteristics of 
2-ME, IGZO solution, and metal precursor solutions. a.u., arbitrary units. 


In(NO3)3°xH,O0, Ga(NO3)3°xH,O, and Zn(CH3CO,).°2H,O in 
2-ME exhibit strong light absorption below 260, 250, and 230nm, 
respectively. As the mercury lamp has two main emission peaks at 
253.7 and 184.9 nm, the photochemical activation of indium, gallium 
and zinc precursor molecules can be facilitated by DUV irradiation 
from the lamp. 

Following successful application of the DUV photo-annealing 
method to IGZO thin films, we investigated its applicability to sol- 
gel films of other binary, ternary and quaternary oxide systems such as 
InzO3, IZO, zinc tin oxide (ZTO), and indium zinc tin oxide (IZTO). 
From preliminary tests with a simplified device architecture (on an 
SiO,/Si wafer without channel isolation), we have concluded that the 
photo-annealing method can be applied to solution-based oxide sys- 
tems except those using ZnCl, solution. This exception can be ascribed 
to the negligible DUV absorption in ZnCl, solution, leading to inef- 
ficient photochemical activation, cleavages and energy transfer by the 
high-energy DUV photons (Supplementary Fig. 4). Note that, whereas 
the IGZO TFTs photo-annealed in an Nz atmosphere have shown 
excellent device characteristics, the performance of the devices 
photo-annealed in air is rather poor and unstable, despite an increase 
in substrate temperature to 180 °C (possibly due to the absence of Nz 
purging; Supplementary Fig. 5). In air, the photo-activation efficiency 
by 184.9-nm emission from the mercury lamp is significantly attenuated, 
mainly owing to absorption by molecular oxygen (O2)””. This causes 
insufficient photochemical cleavage of metal alkoxides and poor densi- 
fication of the resultant film, leading to inactive TFT operation. 

For further investigation of the effectiveness of DUV photo- 
activation for general oxide semiconductor preparation, we fabricated 
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Figure 2 | Transfer characteristics of photo-annealed IGZO, IZO and In,O3 
TFTs using Al,O3 and SiO, gate dielectric, and comparison with thermally 
annealed devices. a, Transfer characteristics and saturation mobility 
distribution of photo-annealed IGZO, IZO and In,O; TFTs fabricated on glass 
with ALO; gate dielectric (~20 devices). The source—drain voltage, Vp, is 10 V 
in all cases. In top panels, red curves are drain current; blue curves are (drain 


both thermally annealed and photo-annealed IGZO, IZO and In,O; 
TFTs, and compared their performance. For the channel layer, the 
as-spun sol-gel films were photo-annealed in an N, atmosphere for 
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current) “; and grey points are gate leakage current (Ic; values on left-hand axis). 
b, Transfer and output characteristics of photo-annealed and thermally annealed 
(150 and 350 °C) IGZO TFTs fabricated on SiO, (200 nm)/Si wafers. Red and 
blue curves as in a. The channel lengths and widths of all measured devices are 
10 jm and 100 im, respectively. In bottom panels, the seven curves are for 
source-gate voltages, Vs, ranging from 0 to 30 V (bottom to top), in 5-V steps. 
90-120 min, corresponding to an irradiation dose of 135-201Jcm™”. 
Interestingly, despite the very small and gradual change in the atomic 
compositions after 60min (Fig. 1b), the transistor mobilities of the 
formed films increase substantially after 90 min, with the best electrical 
properties and spatial uniformity achieved between 90 and 120 min of 
DUV photo-annealing (Supplementary Figs 5c and 6). These two dis- 
tinct trends in atomic composition and electrical properties versus 
photo-activation time suggest that there are two separate stages of 
photo-activation: first, rapid chemical condensation, followed by gradual 
structural rearrangement and densification. The requirement for pro- 
longed DUV exposure, with its accompanying moderate heating effect, 
may facilitate the second stage of photo-activation. Note that there is a 
distribution of optimal photo-activation times (Supplementary Figs 5c 
and 6), possibly due to uneven light intensity and/or power fluctuation of 
the mercury lamp currently installed in our photo-annealing apparatus. 
Figure 2a shows the transfer characteristics of photo-annealed oxide 
TFTs with channel length and width of 10 and 100 1m, respectively, and 
with 35-nm-thick atomic-layer-deposited Al,O3 as a gate dielectric 
(138nEcm 7) on glass substrates. The photo-annealed TFTs have 
shown field-effect mobilities of 8.76 +0.98cm?V 's ! for IGZO, 
4.43 +0.59cm’V 's ! for IZO, and 11.29+1.62cm*V 's ' for 
In,O3. Compared with TFTs annealed at 350 °C, the photo-annealed 


Figure 3 | Electrical characteristics and bias stability of photo-annealed 
IGZO TFTs on flexible substrates. a, Transfer and output characteristics of a 
photo-annealed IGZO TFT fabricated on a PAR substrate. Left panel: curves 
and grey points as in Fig. 2. Right panel: curves are for Vgs ranging from 0 to 
10 V (bottom to top), in 2-V steps. Channel length and width are 10 ym and 
100 jum, respectively. b, Distribution of saturation mobilities of photo-annealed 
IGZO TFTs on PAR (49 devices). c, Threshold voltage shift, AV;, of IGZO 
TFTs under positive gate-bias stress (Ves = +5 V, Vps = +0.1 V). Glass 
substrates are unpassivated; PAR substrates are either unpassivated (green 
curve) or passivated with poly(methylmethacrylate) (blue curve). 
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Figure 4 | Characteristics of seven-stage ring oscillators fabricated on a PAR 
substrate by photo-annealing. a, Optical micrographs and a schematic cross- 
section of photo-annealed IGZO TFTs and circuits on PAR. b, Optical 
micrograph of a seven-stage ring oscillator, with a /-ratio of 2 (see text for 
details of channel width/length ratios). Gate to source/drain overlap distance is 


devices exhibit comparable or enhanced mobilities (Fig. 2a and 
Supplementary Fig. 7). Figure 2b shows the transfer and output 
characteristics of the room-temperature photo-annealed and high- 
temperature thermally annealed IGZO TFTs using thermally grown 
SiO, (200nm) as a gate dielectric. The photo-annealed TFTs have 
shown field-effect mobilities as high as 2.64cm* V_'s_' (Supplemen- 
tary Fig. 6b), which is also comparable to those of the thermally annealed 
devices at high temperature (350-500 °C)'*'’*°. We speculate that the 
different semiconductor mobilities on Al,O3 and SiO) gate dielectrics 
may result from different values of the effective gate electric field (related 
to the gate insulator capacitance and applied gate bias), and the 
semiconductor-dielectric interface effect’. Nonetheless, these results 
show that photo-annealing is an alternative route to high-performance 
semiconductors based on solution-processed metal oxide films, even at 
room temperature. 

To take full advantage of low-temperature photo-activation of 
metal-oxide semiconductors, we fabricated TFT and circuits based 
on a solution-processed and photo-activated oxide semiconductor 
directly on commercially available polyarylate (PAR) film. The DUV 
irradiation induces a slight yellowing of the PAR substrate surface 
(optical transmission loss by 5-10%), but this DUV-induced coloura- 
tion does not propagate beyond the topmost surface of PAR substrates, 
and the mechanical integrity of the film is minimally affected (Sup- 
plementary Fig. 9). Figure 3a, b shows typical device characteristics for 
TFTs made from photo-annealed IGZO on PAR substrates. The 
measured field-effect mobilities are centred at 3.77cm*V 's ! 
(maximum value of ~7cm*V‘s ') with a narrow distribution 
(standard deviation of 1.02cm?V 's ‘, from 49 devices). Also, the 
devices show excellent current on/off modulation, sub-threshold swing, 
and threshold voltage (Vr) values of 10°, 95.8 + 20.8 mV per decade, 
and 2.70 + 0.47 V, respectively. 

We performed positive-gate-bias stress tests to verify the opera- 
tional stability of photo-annealed IGZO TFTs in air, in dark conditions 
(Fig. 3c). Even without device packaging or passivation, the photo- 
annealed IGZO TFTs fabricated on glass substrates reveal outstanding 
operational stability, with a very small Vy shift (AV) of 1.12 V after a 
gate-bias stress time of 10,000 s (Supplementary Fig. 10). Note that the 
gate-bias stability is comparable to that of devices annealed at 350 °C 
under identical stress condition (AV; of 0.86 V), and exceptionally 
low compared with that of previously reported devices based on 
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5 um. ¢, Oscillation frequency (red) and per-stage propagation delay (blue) of 
seven-stage ring oscillator as a function of supply voltage, Vpp. d, Output 
waveforms of the seven-stage ring oscillator operating with supply voltages of 
5 V (left panel) and 15 V (right panel), and oscillation frequencies of 45 and 
341 kHz, respectively. 


solution-processed metal-oxide semiconductors'*'*’***, In the case 
of photo-annealed IGZO TFT'ss on PAR substrates, the unpassivated 
and poly(methyl methacrylate)-passivated (~300-nm-thick) devices 
exhibit AV; values of 4.5 and 3 V, respectively. More stable TFT char- 
acteristics on glass substrates may be attributed to the presence of 
fewer interfacial trap states at the interface between semiconductor 
and gate dielectric, possibly as a result of the low surface roughness 
of the dielectric layer’? (Supplementary Fig. 11). To demonstrate 
device scalability, we fabricated seven-stage ring oscillator circuits on 
the PAR substrates (Fig. 4a, b). The room-temperature-fabricated 
IGZO TFTs on polymer substrates are typically enhancement-mode 
devices, and allow simple digital logic circuits without level shifting. 
The inverter in the ring oscillator had a f-ratio of 2 (channel width-to- 
length ratios (W/L)drive = 100 pm/7 Wm and (W/L)ioaq = 50 Lm/ 
7um), with an overlap distance of 5\1m between the gate and 
source/drain electrodes. With a supply voltage of Vpp = 15 V, we 
measured an oscillation frequency greater than ~340 kHz, and cor- 
responding propagation delay less than ~210 ns per stage (Fig. 4c, d). 

We propose that DUV-assisted photochemistry approaches can 
opena new route for achieving high-performance, flexible and printed, 
metal-oxide thin-film electronic devices. Translation of this photo- 
annealing process to industrial applications may be helped by modi- 
fying the sol-gel solutions to include DUV-decomposable additives 
(fuels) and solvents, as well as by increasing the DUV energy density to 
boost the DUV-assisted photo-activation. 


METHODS SUMMARY 


We prepared solutions for IGZO, IZO and In,O3 by dissolving indium nitrate 
hydrate, gallium nitrate hydrate and zinc acetate dihydrate in 2-~ME (Supplemen- 
tary Fig. 12). DUV photo-annealing was conducted by placing the as-spun films 
under a high-density DUV treatment system (UV253H, Filgen) under N2 purging 
(film spacing 1-5cm, 25-28mWcm ”). The light source is a low-pressure 
mercury lamp with two main emission peaks at 253.7 nm (90%) and 184.9 nm 
(10%). 

For the fabrication of photo-annealed metal-oxide TFTs on glass substrates, we 
used 0.7-mm-thick glass (Eagle 2000, Samsung Corning Precision Glass). Gate 
electrodes were defined by patterning Ti/Au (3 nm/80 nm) or Mo (100 nm) layers. 
Gate dielectrics were 35-nm-thick Al,O3, deposited by atomic layer deposition at 
100 °C over the gate-patterned substrates. For the channel layer, oxide solutions 
were spin-coated and photo-annealed in N, atmosphere. After subsequent 
patterning of the channel layer by wet etching, via holes and 100-nm-thick IZO 
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source/drain electrodes were fabricated. For reference devices, spin-coated IGZO, 
IZO and In,O3 films were thermally annealed at 350°C for 60 min on a hot 
plate in air. 

For the fabrication of photo-annealed metal-oxide TFTs on polymer substrates, 
we used 200-j1m-thick PAR films (A200HC, Ferrania Technologies), which have 
good dimensional stability. Other TFT fabrication processes are identical to those 
on glass substrates. 


Full Methods and any associated references are available in the online version of 
the paper. 
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METHODS 


Solutions for IGZO, IZO, IZTO and In,03 were prepared by the following 
procedure. Metal precursors comprising powders of indium nitrate hydrate 
(In(NO3)3*xH2O), gallium nitrate hydrate (Ga(NO3)3"xH2O), zinc acetate 
dihydrate (Zn(CH3CO,),°2H,O), zinc chloride (ZnCl,), tin acetate 
(Sn(CH3CO3;),4) and tin chloride (SnCl,) (all from Sigma-Aldrich) were dissolved 
in 2-ME (anhydrous, Sigma-Aldrich). After dissolving the precursors in the 
solvent, the solutions were thoroughly stirred for more than 12 h at 75°C. A 
solution for ZTO was prepared as follows. ZnCl, and SnCl, powders were dis- 
solved in acetonitrile (anhydrous, Sigma-Aldrich) with Zn:Sn molar concentra- 
tions of 0.07 M:0.07 M. After dissolving the precursors in the solvent, the solution 
was stirred for 15 min at room temperature. The optical absorption characteristics 
of precursor solutions were analysed by an ultraviolet-visible spectrophotometer 
(V-560, JASCO) in the wavelength range 190-500 nm. Each solution was placed in 
a quartz cuvette after dissolution of the precursors. 

Light-assisted photochemical activation and film characterization. The light- 
assisted photochemistry was performed by a high-density ultraviolet treatment 
system with a low-pressure Hg lamp (emission wavelengths of 253.7 nm (90%) 
and 184.9 nm (10%); area of 20 X 20 cm?; UV253H, Filgen) in N2-purging con- 
ditions. The output energy intensity of the lamp was ~25-28 mWcm ’, and 
varied slightly with measurement position. The corresponding flux density of 
photons is 2.88-3.22 x 10° m~*s_! (A = 253.7 nm, 90% of total power density) 
and 2.32-2.6 X 10'?m~*s_' (i = 184.9 nm, 10% of total power density). The total 
energy delivered to the sample surface is calculated to be 135-151Jcm’* and 
180-201 J cm~* for 90 min and 120 min of irradiation, respectively. The as-spun 
samples were placed under the DUV lamp at ~1-5 cm spacing. N> gas was con- 
tinuously inserted to prevent formation of ozone, and create an inert gas atmo- 
sphere inside the chamber that would allow transmission of DUV (especially the 
184.9 nm wavelength) without significant attenuation. The DUV irradiation time 
was controlled in the range 30-120 min for photochemical reactions. The radiant 
thermal energy of the mercury lamp increased the surface temperature of the 
substrate to 130-150°C, and this temperature was maintained during the 
photo-annealing process. The surface temperature of the substrate was measured 
by an infrared camera (InfraCAM, FLIR System). 

The X-ray photoelectron spectra were analysed by Escalab 220i-XL Thermo VG 
Scientific, using a monochromated Al Ka source at 1486.6 eV with a base pressure 
of 7.8 X 10 '° mbar. For each sample, Ar ion etching was carried out before the 
analysis. The Rutherford backscattering measurements were performed with He* 
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particles delivered by a 450-keV vertical accelerator (HRBS V500, KOBELCO). 
The HRTEM images were obtained by JEM-3010 (JEOL) using a 300-kV trans- 
mission electron microscope with a LaB, electron source, and the samples were 
prepared by Ar’ ion milling (Model 1010, Fischione Instruments) after mech- 
anical polishing. 

Transistor and circuit fabrication. For the fabrication of solution-processed 
metal-oxide TFTs on glass substrates, 0.7-mm-thick glass substrates (Eagle 
2000, Samsung Corning Precision Glass) were used. As a gate electrode, 
thermally evaporated Au (80nm) with a 3-nm-thick Ti adhesion layer, or 
sputter-deposited Mo (100 nm), was patterned by a standard photolithography 
process and wet etching. On the gate electrode, a 35-nm-thick Al,O3 gate dielectric 
layer was deposited by atomic layer deposition at 100°C using trimethyl 
aluminium. For the channel layer, oxide solutions were spin-coated (25-35 nm 
thick) and photo-annealed in a N atmosphere for 90-120 min. After patterning 
the channel layer (7-10 nm thick) by photolithography and wet etching, via holes 
were etched and finally IZO source/drain electrodes (100 nm thick) were deposited 
and patterned by a lift-off process. The wet etching of the IGZO layer was carried 
out by LCE-12K (an ITO etchant) from Cyantek Corporation. For reference 
devices, the spin-coated IGZO, IZO and In,O; films were first baked at 200°C 
for 10 min, then annealed at 350 °C for 60 min on a hot plate in air. In the case of 
ZTO and IZTO films, the spin-coated films were baked at 200 °C for 10 min and 
annealed at 500 °C for 10 min by a rapid thermal annealing system’. 

For the fabrication of solution-processed metal-oxide TFTs on polymer sub- 
strates, 200-j1m-thick PAR films (A200HC, Ferrania Technologies) were used 
because of their dimensional stability following chemical treatment. As a gate elec- 
trode, thermally evaporated Au (80 nm) with a 3-nm-thick Ti adhesion layer, or 
sputtered Mo (100 nm) was patterned by a standard photolithography process and 
wet etching. On the gate electrode, a 35-nm-thick Al,O; gate dielectric layer was 
deposited by atomic layer deposition at 100 °C using trimethyl aluminium. For the 
channel layer, precursor solutions were spin-coated and photo-annealed in a No 
atmosphere for 90-120 min. After patterning the channel layer by photolithography 
and wet etching, via holes were etched and finally IZO source/drain electrodes were 
deposited and patterned by a lift-off process. Finally, some devices were prepared 
with polymer passivation (encapsulation) on the channel area, for the comparison of 
operational stability. The passivation (encapsulation) process was carried out with 
poly(methyl methacrylate)(PMMA, MicroChem C4 or A4). The PMMA was spun 
over the source/drain electrodes and channel areas and annealed at 150°C for 
10 min. The final thickness of the PMMA layer was ~300 nm. 
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Highly stretchable and tough hydrogels 
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Hydrogels are used as scaffolds for tissue engineering’, vehicles 
for drug delivery’, actuators for optics and fluidics’, and model 
extracellular matrices for biological studies*. The scope of hydrogel 
applications, however, is often severely limited by their mechanical 
behaviour®. Most hydrogels do not exhibit high stretchability; 
for example, an alginate hydrogel ruptures when stretched to about 
1.2 times its original length. Some synthetic elastic hydrogels®” 
have achieved stretches in the range 10-20, but these values are 
markedly reduced in samples containing notches. Most hydrogels 
are brittle, with fracture energies of about 10J m7? (ref. 8), as 
compared with ~1,000 J m ? for cartilage’ and ~10,000J m? for 
natural rubbers’. Intense efforts are devoted to synthesizing 
hydrogels with improved mechanical properties'’”*; certain 
synthetic gels have reached fracture energies of 100-1,000 J m~* 
(refs 11, 14, 17). Here we report the synthesis of hydrogels from 
polymers forming ionically and covalently crosslinked networks. 
Although such gels contain ~90% water, they can be stretched 
beyond 20 times their initial length, and have fracture energies of 
~9,000 J m~*. Even for samples containing notches, a stretch of 17 is 
demonstrated. We attribute the gels’ toughness to the synergy of 
two mechanisms: crack bridging by the network of covalent 
crosslinks, and hysteresis by unzipping the network of ionic cross- 
links. Furthermore, the network of covalent crosslinks preserves the 
memory of the initial state, so that much of the large deformation is 
removed on unloading. The unzipped ionic crosslinks cause 
internal damage, which heals by re-zipping. These gels may serve 
as model systems to explore mechanisms of deformation and energy 
dissipation, and expand the scope of hydrogel applications. 
Certain synthetic hydrogels exhibit exceptional mechanical 
behaviour. A hydrogel containing slide-ring polymers can be stretched 
to more than 10 times its initial length®; a tetra-poly(ethylene glycol) 
gel has a strength of ~2.6 MPa (ref. 7). These gels deform elastically. 
An elastic gel is known to be brittle and notch-sensitive; that is, the 
stretchability and strength decrease markedly when samples contain 
notches, or any other features that cause inhomogeneous deforma- 
tion’. A gel can be made tough and notch-insensitive by introducing 
energy-dissipating mechanisms. For example, a fracture energy of 
~1,000Jm 2 is achieved with a double-network gel, in which two 
networks—one with short chains, and the other with long chains— 
are separately crosslinked by covalent bonds’. When the gel is 
stretched, the short-chain network ruptures and dissipates energy””. 
But the rupture of the short-chain network causes permanent damage. 
After the first loading, the gel does not recover from this damage; thus, 
on subsequent loadings, the fracture energy is much reduced*'. To 
enable recoverable energy-dissipating mechanisms, several recent 
works have replaced the sacrificial covalent bonds with non-covalent 
bonds. In a gel with a copolymer of triblock chains, for example, the 
end blocks of different chains form glassy domains, and the midblocks 
of different chains form ionic crosslinks”. When the gel is stretched, 
the glassy domains remain intact, while the ionic crosslinks break and 
dissipate energy. The ionic crosslinks then re-form during a time 


interval after the first loading’. Recoverable energy dissipation can 
also be effected by hydrophobic associations'”"*. When a gel made with 
hydrophobic bilayers in a hydrophilic polymer network is stretched, 
the bilayers dissociate and dissipate energy; on unloading, the bilayers 
re-assemble, leading to recovery'’. However, previous studies along 
these lines have demonstrated fracture energy comparable to, or lower 
than, that of the double-network gels. 

We have synthesized extremely stretchable and tough hydrogels by 
mixing two types of crosslinked polymer: ionically crosslinked alginate, 
and covalently crosslinked polyacrylamide (Fig. 1). An alginate chain 
comprises mannuronic acid (M unit) and guluronic acid (G unit), 
arranged in blocks rich in G units, blocks rich in M units, and blocks 
of alternating G and M units. In an aqueous solution, the G blocks in 
different alginate chains form ionic crosslinks through divalent cations 
(for example, Ca’), resulting in a network in water—an alginate 
hydrogel. By contrast, in a polyacrylamide hydrogel, the polyacrylamide 
chains form a network by covalent crosslinks. We dissolved powders 
of alginate and acrylamide in deionized water. (Unless otherwise 
stated, the water content was fixed at 86 wt %.) We added ammonium 
persulphate as a photo-initiator for polyacrylamide, and N,N- 
methylenebisacrylamide as the crosslinker for polyacrylamide. After 
degassing the solution in a vacuum chamber, we added N,N,N’,N’- 
tetramethylethylenediamine, at 0.0025 the weight of acrylamide, as the 
crosslinking accelerator for polyacrylamide, and calcium sulphate 
slurry (CaSO,4*2H,0O) as the ionic crosslinker for alginate. We poured 
the solution into a glass mould measuring 75.0 X 150.0 X 3.0mm’, 
covered with a 3-mm-thick glass plate. The gel was cured in one step 
with ultraviolet light for 1 hour (with 8 W power and 254 nm wave- 
length at 50 °C), and was then left in a humid box for 1 day to stabilize 
the reactions. After the curing step, we took the gel out of the humid 
box, and removed water on its surfaces using N, gas for 1 minute. 

The gel was glued to two polystyrene clamps, resulting in specimens 
measuring 75.0 X 5.0 X 3.0 mm’. All mechanical tests were performed 
in air, at room temperature, using a tensile machine with a 500-N load 
cell. In both loading and unloading, the rate of stretch was kept con- 
stant at 2 min‘. We stretched an alginate-polyacrylamide hybrid gel 
to >20 times its original length without rupture (Fig. 2a,b). The hybrid 
gel was also extremely notch-insensitive. When we cut a notch into the 
gel (Fig. 2c) and then pulled it to a stretch of 17, the notch was 
dramatically blunted and remained stable (Fig. 2d). At a critical 
applied stretch, a crack initiated at the front of the notch, and ran 
rapidly through the entire sample (Supplementary Movie 1). Large, 
recoverable deformation is demonstrated by dropping a metal ball ona 
membrane of the gel fixed by circular clamps (Supplementary Movie 
2). On hitting the membrane, the ball stretched the membrane greatly 
and then bounced back. The membrane remained intact, vibrated, and 
recovered its initial flat configuration after the vibration was damped 
out. A ball with greater kinetic energy, however, caused the membrane 
to rupture after large deformation (Supplementary Movie 3). 

The extremely stretchable hybrid gels are even more remarkable 
when compared with their parents, the alginate and polyacrylamide 
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Figure 1 | Schematics of three types of hydrogel. a, In an alginate gel, the G 
blocks on different polymer chains form ionic crosslinks through Ca~* (red 
circles). b, In a polyacrylamide gel, the polymer chains form covalent crosslinks 
through N,N-methylenebisacrylamide (MBAA; green squares). c, In an 
alginate-polyacrylamide hybrid gel, the two types of polymer network are 
intertwined, and joined by covalent crosslinks (blue triangles) between amine 


gels (Fig. 3a). The amounts of alginate and acrylamide in the hybrid 
gels were kept the same as those in the alginate gel and polyacrylamide 
gel, respectively. When the stretch was small, the elastic modulus of the 
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groups on polyacrylamide chains and carboxyl groups on alginate chains. 
Materials used were as follows: alginate (FMC Biopolymer, LF 20/40); 
acrylamide (Sigma, A8887); ammonium persulphate (Sigma, A9164); MBAA 
(Sigma, M7279); N,N,N',N’-tetramethylethylenediamine (Sigma, T7024); 
CaSO,4°2H,O (Sigma, 31221); ultraviolet lamp (Hoefer, UVC 500). 


hybrid gel was 29 kPa, which is close to the sum of the elastic moduli of 
the alginate and polyacrylamide gels (17 kPa and 8 kPa, respectively). 
The stress and stretch at rupture were, respectively, 156 kPa and 23 for 
the hybrid gel, 3.7 kPa and 1.2 for the alginate gel, and 11 kPa and 6.6 
for the polyacrylamide gel. Thus, the properties at rupture of the 
hybrid gel far exceeded those of either of its parents. 

Hybrid gels dissipate energy effectively, as shown by pronounced 
hysteresis. The area between the loading and unloading curves of a gel 
gives the energy dissipated per unit volume (Fig. 3b). The alginate gel 
exhibited pronounced hysteresis and retained significant permanent 
deformation after unloading. In contrast, the polyacrylamide gel 
showed negligible hysteresis, and the sample fully recovered its original 
length after unloading. The hybrid gel also showed pronounced 
hysteresis, but the permanent deformation after unloading was signifi- 
cantly smaller than that of the alginate gel. The pronounced hysteresis 
and relatively small permanent deformation of the hybrid gel were 
further demonstrated by loading several samples to large values of 
stretch before unloading (Fig. 3c). 

After the first loading and unloading, the hybrid gel was much 
weaker if the second loading was applied immediately, and recovered 
somewhat if the second loading was applied 1 day later (Fig. 3d and 
Supplementary Fig. 1). We loaded a sample of the hybrid gel to a 
stretch of 7, and then unloaded the gel to zero force. The sample was 


Figure 2 | The hybrid gel is highly stretchable and notch-insensitive. a, A 
strip of the undeformed gel was glued to two rigid clamps. b, The gel was 
stretched to 21 times its initial length in a tensile machine (Instron model 3342). 
The stretch, /, is defined by the distance between the two clamps when the gel is 
deformed, divided by the distance when the gel is undeformed. c, A notch was 
cut into the gel, using a razor blade; a small stretch of 1.15 was used to make the 
notch clearly visible. d, The gel containing the notch was stretched to 17 times 
its initial length. The alginate/acrylamide ratio was 1:8. The weight of the 
covalent crosslinker, MBAA, was fixed at 0.0006 that of acrylamide; the weight 
of the ionic crosslinker, CaSO., was fixed at 0.1328 that of alginate. 
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Figure 3 | Mechanical tests under various conditions. a, Stress—stretch 
curves of the three types of gel, each stretched to rupture. The nominal stress, s, 
is defined as the force applied on the deformed gel, divided by the cross- 
sectional area of the undeformed gel. b, The gels were each loaded toa stretch of 
1.2, just below the value that would rupture the alginate gel, and were then 
unloaded. c, Samples of the hybrid gel were subjected to a cycle of loading and 
unloading of varying maximum stretch. d, After the first cycle of loading and 


then sealed in a polyethylene bag and submerged in mineral oil to 
prevent water from evaporating, and stored in a fixed-temperature 
bath for a prescribed time. The sample was then taken out of storage 
and its stress-stretch curve was measured again at room temperature. 
The internal damage was much better healed by storing the gel at an 
elevated temperature for some time before reloading (Fig. 3e and 
Supplementary Fig. 2). After storing at 80°C for 1 day, the work on 
reloading was recovered to 74% of that of the first loading (Fig. 3f). 

We prepared gels containing various proportions of alginate and 
acrylamide to study why the hybrids were much more stretchable and 
stronger than either of their parents. When the proportion of 
acrylamide was increased, the elastic modulus of the hybrid gel 
decreased (Fig. 4a). However, the critical stretch at rupture reached a 
maximum at 89 wt % acrylamide. A similar trend was observed for 
samples with notches (Fig. 4c). The fracture energy reached a 
maximum value of 8,700Jm * at 86 wt % acrylamide (Fig. 4d). The 
densities of ionic and covalent crosslinks also strongly affect the 
mechanical behaviour of the hybrid gels (Supplementary Figs 3, 4), 
as well as that of pure alginate gels (Supplementary Fig. 5) and pure 
polyacrylamide gels (Supplementary Fig. 6). 

Our experimental findings provide insight into the mechanisms of 
deformation and energy dissipation in these gels. When an unnotched 
hybrid gel is subjected to a small stretch, the elastic modulus of the 
hybrid gel is nearly the sum of those of the alginate and polyacrylamide 
gels. This behaviour is also suggested by viscoelastic moduli deter- 
mined for the hybrid and pure gels (Supplementary Fig. 7). Thus, in 
the hybrid gel the alginate and polyacrylamide chains both bear loads. 
Moreover, alginate is finely and homogeneously dispersed in the 
hybrid gel, as demonstrated by using fluorescent alginate and by 
measuring local elastic modulus with atomic force microscopy 
(Supplementary Fig. 8). The load sharing of the two networks may 
be achieved by entanglements of the polymers, and by possible covalent 
crosslinks formed between the amine groups on polyacrylamide 
chains and the carboxyl groups on alginate chains (Fig. 1, Supplemen- 
tary Figs 9, 10). As the stretch increases, the alginate network 
unzips progressively”, while the polyacrylamide network remains 
intact, so that the hybrid gel exhibits pronounced hysteresis and little 
permanent deformation. As only the ionic crosslinks are broken, and 


unloading (red curve), one sample was reloaded immediately, and the other 
sample was reloaded after 1 day (black curves, as labelled). e, Recovery of 
samples stored at 80 °C for different durations, as labelled. f, The work of the 
second loading, W2na, normalized by that of the first loading, W1,, measured 
for samples stored for different durations at different temperatures. The 
alginate/acrylamide ratio was 1:8 for a and b, and 1:6 for c-f. Weights of 
crosslinkers were fixed as described in Fig. 2 legend. 


the alginate chains themselves remain intact, the ionic crosslinks can 
re-form, leading to the healing of the internal damage. 

The giant fracture energy of the hybrid gel is remarkable, consider- 
ing that its parents—the alginate and polyacrylamide gels—have 
fracture energies of 10-250J m * (Supplementary Figs 5, 6). The rela- 
tively low fracture energy of a hydrogel comprising a single network 
with covalent crosslinks is understood in terms of the Lake~Thomas 
model®. When the gel contains a notch and is stretched, the deforma- 
tion is inhomogeneous; the network directly ahead of the notch is 
stretched more than elsewhere (Supplementary Fig. 11). For the notch 
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Figure 4 | Composition greatly affects behaviour of the hybrid gel. a, Stress— 
strain curves of gels of various weight ratios of acrylamide to (acrylamide plus 
alginate), as labelled. Each test was conducted by pulling an unnotched sample 
to rupture. b, Elastic moduli calculated from stress-strain curves, plotted 
against weight ratio. c, Critical stretch, 4,, for notched gels of various weight 
ratios, measured by pulling the gels to rupture. d, Fracture energy, J, as a 
function of weight ratio. Weights of crosslinkers were fixed as described in 
Fig. 2 legend. Error bars show standard deviation; sample size n = 4. 
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to turn into a running crack, only the chains directly ahead of the notch 
need to break. Once a chain breaks, the energy stored in the entire 
chain is dissipated. In the ionically crosslinked alginate, fracture pro- 
ceeds by unzipping ionic crosslinks and pulling out chains. After one 
pair of G blocks unzip, the high stress shifts to the neighbouring pair of 
G blocks and causes them to unzip also (Supplementary Fig. 11). For 
the notch in the alginate gel to turn into a running crack, only the 
alginate chains crossing the crack plane need to unzip, leaving the 
network elsewhere intact. In both polyacrylamide gel and alginate 
gel, rupture results from localized damage, leading to small fracture 
energies. 

That a tough material can be made of brittle constituents is remin- 
iscent of transformation-toughening ceramics, and of composites made 
of ceramic fibres and ceramic matrices. The toughness of the hybrid gel 
can be understood by adapting a model well studied for toughened 
ceramics” and for gels of double networks of covalent crosslinks”*”’. 
When a notched hybrid gel is stretched, the polyacrylamide network 
bridges the crack and stabilizes deformation, enabling the alginate 
network to unzip over a large region of the gel (Supplementary 
Fig. 11). The unzipping of the alginate network, in its turn, reduces 
the stress concentration of the polyacrylamide network ahead of the 
notch. The model highlights the synergy of the two toughening 
mechanisms: crack bridging and background hysteresis. 

The idea that gels can be toughened by mixing weak and strong 
bonds has been exploited in several ways, including hydrophobic asso- 
ciations’’, particle-filled gels”’° and supramolecular chemistry”. The 
fracture energy of the alginate-polyacrylamide hybrid gel, however, is 
much larger than previously reported values'*’””°’* for tough syn- 
thetic gels (100-1,000Jm”), a finding that we attribute to how the 
alginate network unzips. Each alginate chain contains a large number 
of G blocks, many of which form ionic crosslinks with G blocks on 
other chains when enough Ca*" ions are present!. When the hybrid gel 
is stretched, the polyacrylamide network remains intact and stabilizes 
the deformation, while the alginate network unzips progressively, with 
closely spaced ionic crosslinks unzipping at a small stretch, followed by 
more and more widely spaced ionic crosslinks unzipping as the stretch 
increases. 

Because of the large magnitude of the fracture energy and the pro- 
nounced blunting of the notches, we ran a large number of experi- 
ments to determine the fracture energy, using three types of specimen, 
as well as changing the size of the specimens (Supplementary Figs 
12-16). The experiments showed that the measured fracture energy 
is independent of the shape and size of the specimens. 

Our data suggest that the fracture energy of hydrogels can be greatly 
increased by combining weak and strong crosslinks. The combination 
of relatively high stiffness, high toughness and recoverability of stiff- 
ness and toughness, along with an easy method of synthesis, make 
these materials ideal candidates for further investigation. Further 
development is needed to relate macroscopically observed mechanical 
behaviour to microscopic parameters. Many types of weak and strong 
molecular integrations can be used, making hybrid gels of various 
kinds a fertile area of research. In many applications, the use of 
hydrogels is often severely limited by their mechanical properties. 
For example, the poor mechanical stability of hydrogels used for cell 
encapsulation often leads to unintended cell release and death”’, and 
low toughness limits the durability of contact lenses*®. Hydrogels of 
superior stiffness, toughness, stretchability and recoverability will 
improve the performance in these applications, and will probably open 
up new areas of application for this class of materials. 
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Activation of old carbon by erosion of coastal and 
subsea permafrost in Arctic Siberia 
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The future trajectory of greenhouse gas concentrations depends on 
interactions between climate and the biogeosphere’”. Thawing of 
Arctic permafrost could release significant amounts of carbon into 
the atmosphere in this century’. Ancient Ice Complex deposits 
outcropping along the ~7,000-kilometre-long coastline of the 
East Siberian Arctic Shelf (ESAS)**, and associated shallow subsea 
permafrost®”’, are two large pools of permafrost carbon*, yet their 
vulnerabilities towards thawing and decomposition are largely 
unknown’"'. Recent Arctic warming is stronger than has been 
predicted by several degrees, and is particularly pronounced over 
the coastal ESAS region’”’*. There is thus a pressing need to 
improve our understanding of the links between permafrost 
carbon and climate in this relatively inaccessible region. Here we 
show that extensive release of carbon from these Ice Complex 
deposits dominates (57+2 per cent) the sedimentary carbon 
budget of the ESAS, the world’s largest continental shelf, over- 
whelming the marine and topsoil terrestrial components. Inverse 
modelling of the dual-carbon isotope composition of organic 
carbon accumulating in ESAS surface sediments, using Monte 
Carlo simulations to account for uncertainties, suggests that 
44+ 10 teragrams of old carbon is activated annually from Ice 
Complex permafrost, an order of magnitude more than has been 
suggested by previous studies’*. We estimate that about two-thirds 
(66 + 16 per cent) of this old carbon escapes to the atmosphere as 
carbon dioxide, with the remainder being re-buried in shelf 
sediments. Thermal collapse and erosion of these carbon-rich 
Pleistocene coastline and seafloor deposits may accelerate with 
Arctic amplification of climate warming”. 

The large magnitude of shallow permafrost carbon pools relative to 
the atmospheric pools of carbon dioxide (~760 Pg) and methane 


Laptev Sea 


Figure 1 | Erosion of Ice Complex deposits on the East Siberian Arctic Shelf. 
a, Eroding, carbon-rich Ice Complex coast on Muostakh Island in the 
southeastern Laptev Sea. b, Erosion-induced turbidity clouds envelop several 
thousand kilometres of East Siberian Sea coastal waters. Note the rounded 


(~3.5 Pg) suggests that carbon release from thawing permafrost has 
the potential to affect large-scale carbon cycling. Arctic permafrost can 
be divided into three main compartments: terrestrial (tundra and 
taiga) permafrost (~1,000 Pg C)*, Ice Complex (coastal and inland) 
permafrost (~400 Pg C)** and subsea permafrost (~1,400 Pg C)°”. 
Even without considering subsea permafrost, the carbon held in the 
top few metres of the pan-arctic permafrost constitutes approximately 
half of the global soil organic carbon pool’. 

Investigations of Arctic greenhouse gas releases have focused on 
terrestrial permafrost systems*®'*, and only recently on subsea 
permafrost*”'*'’, with a notable scarcity of studies on the thawing 
permafrost outcropping along the Arctic coast. In particular, the 
extensive coastline of the Eastern Siberian Sea (ESS) is dominated by 
exposed tall bluffs comprising ice-rich, fine-grained Ice Complex 
deposits (Fig. 1a). The origin of the ~1-million-km’ deposits (with 
average depth 25m) dominating northeastern Siberia (and parts of 
Alaska and northwestern Canada) is under some debate, but this 
Pleistocene material is quite distinct from peat and mineral soil of 
other Arctic permafrost**. These relict soils of the steppe-tundra 
ecosystem have high carbon contents (1-5%)*”. The export of organic 
carbon from the eroding ESAS Ice Complex is presently estimated at 
4Tgyr ' (ref. 14), yet it has also been proposed that erosion from the 
Lena Delta coastline alone might contribute this amount’*. Clearly, 
large uncertainties remain regarding the magnitude of eroded carbon 
export from land to the shelf. 

The extensive coastal exposure of the Ice Complex deposits (ICD) 
makes them potentially more vulnerable than other terrestrial 
permafrost; ICD retreat rates are 5-7 times higher than those of other 
coastal permafrost bodies'*. A destructive thaw-erosion process 
brought on by thermal collapse of the coastline promotes surface 
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shorelines of northeastern Siberia, indicative of coastal erosion. Red dashed line 
shows areas of intensive ongoing erosion. (Satellite image of 24 August 2000, 
available at http://visibleearth.nasa.gov.) 
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subsidence, with ICD loss exacerbated by the increased wave and wind 
erosion that accompany sea-level rise and longer ice-free seasons’. 
Satellite images show a large erosional turbidity cloud along the 
ESAS coastline (Fig. 1b). From limited land-based surveys, this ICD 
erosion is thought to be delivering as much total organic carbon to the 
ESAS as all its large rivers combined’””*. Unfortunately, these studies 
are limited in spatial coverage, and do not consider the fate of the 
released carbon in the receiving ocean. There are no field-based reports 
of degradation or greenhouse-gas releases of thawing ICD; however, a 
recent investigation of organic matter genesis in ESS surface sediments 
suggests that ICD erosion may dominate over planktonic and riverine 
sources”. Laboratory experiments have shown that microbial degra- 
dation begins once permafrost has thawed, implying survival of viable 
bacteria and an inherent lability of the very old ICD organic carbon 
(ICD-OC)'*"!. In addition to terrestrial ICD, the ESAS sediments 
(inundated by seawater during the early Holocene epoch) also host 
large Pleistocene deposits, presumably containing carbon in quantities 
similar to those in the upper-1-m soil pool®*. These reservoirs are 
subject to active sea-floor thermal erosion'*’’, potentially releasing 
as much organic carbon as coastal erosion and rivers”’. Overall, carbon 
released from thawing and eroding coastal permafrost may play a 
quantitatively important role in the Arctic carbon cycle. 

To evaluate the role of the ICD and subsea permafrost carbon 
(hereafter jointly referred to as ICD-PF) in the contemporary ESAS 
carbon cycle, we adopted an inverse approach based on deducing the 
contribution of this ICD-PF to carbon accumulating on the entire ESAS 
shelf. We analysed more than 200 sediment samples (see Methods 
Summary), collected during ship-based expeditions spanning the ESAS 
(Supplementary Fig. 2, Supplementary Methods). We used a dual- 
carbon-isotope (8°C and AMC) mixing model, solved with a Monte 
Carlo simulation strategy to account for endmember uncertainties, to 
deconvolve the relative contributions from ICD-PF, plankton detritus 
and a terrestrial/topsoil component. We then combined the fractional 
contribution from ICD-PF with the radiochronologically constrained 
sediment accumulation flux (Methods Summary and Supplementary 
Methods) to derive the shelf-wide re-burial flux of old carbon from 
permafrost. 

We examined the fate of thawing ICD-OC in ambient conditions on 
coastal slopes of Muostakh, an island in the southeastern Laptev Sea 
that is disappearing as a result of erosion rates of up to 20m yr’ (refs 
19,20,22; Fig. la). Bulk carbon contents, and molecular and isotopic 
compositions of ICD-OC, were assessed in conjunction with in situ 
CO), evasion fluxes (Supplementary Methods) to assess susceptibility 
of the organic carbon to degradation before delivery into coastal 
waters. 

Radiocarbon ages of surface-sediment organic carbon ranged 
between 10,800 and 7,300 '*C yr (Fig. 2a shows A'“C values; see also 
Supplementary Table 1) in the western ESS and the Dmitry Laptev 
Strait, regions dominated by coastal erosion (Fig. 1b). Organic-carbon 
radiocarbon ages were also old in the southern ESS and the Laptev Sea, 
ranging from 7,800 to 3,200 '*C yr. Lateral shelf transport times are 
likely to be much smaller than these measured *C ages”, implying 
significant supply of pre-aged carbon to these sediments. 5'°C values 
varied, from —28.3 to —25.2%v near the coast, to —24.8 to —21.2%o on 
the outer ESAS (Fig. 2b; Supplementary Table 1). In contrast to other 
world-ocean shelf seas, where the sediment organic carbon originates 
from planktonic and riverine sources, coastline and sediment erosion 
represent significant sources of organic carbon to the ESAS. The rela- 
tive contribution of the three sources was deduced from their carbon 
isotope fingerprints. In addition to a marine source, with 
83C = —24 + 3.0% and AC = 60 + 60%o (mean + standard devi- 
ation (s.d.); Supplementary Methods, Supplementary Figs 4, 5), we 
distinguish between two terrestrial sources: ICD-PF organic carbon 
(coastal, inland, and subsea; formed before inundation), with 
58°C = —26.3 + 0.67% and AC = —940 + 84% (Supplementary 
Fig. 4, Supplementary Table 4), and topsoil permafrost (topsoil-PF) 
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Figure 2 | Carbon isotope compositions and contribution of organic carbon 
sources to sediment accumulation on the East Siberian Arctic Shelf. 

a, b, AMC-OC (a) and 5'°C-OC (b) signals in ESAS surface sediments. 

c, Annual sedimentary organic carbon accumulation fluxes (g OC m ’ yr ') 
and relative contributions (pie charts) of the three source pools to the surface- 
sediment organic carbon on the ESAS. The mean ESS contributions are: 

57 + 1.6% from ICD-PF (grey), 16 + 3.4% from topsoil-PF (green) and 

26 + 8.0% from marine/planktonic organic carbon (blue), as identified by 
numerical (Monte Carlo) simulations of the dual-carbon-isotope (5'°C and 
A‘*C) and endmember mixing models. Land area marked in light grey indicates 
the distribution of the Ice Complex”. 


organic carbon (drained from vegetation debris and the thin, surficial, 
annual thaw layer of the continuous permafrost regions of northeast 
Siberia), with 8°C= —28.2+1.96%0 and AC = —126 + 54%o 
(Supplementary Fig. 4, Supplementary Table 3 and Supplementary 
Methods). The endmember source assignments are based on an 
extensive compilation of circum-arctic literature data, yielding 
statistically robust and distinctive values for the three endmembers, 
as further explained in the Supplementary Information (Supplemen- 
tary Text; Supplementary Figs 4, 5; Supplementary Tables 3, 4). 
Naturally, the isotopic endmember values carry uncertainties, which 
may be reduced in the future by additional observations of the marine 
and topsoil composition. The '*C and '*C compositions of the three 
endmembers are well separated from each other (Supplementary 
Fig. 4), which allows separation of their contributions while properly 
accounting for the associated uncertainties using the Monte Carlo 
simulation approach. We stress that the two terrestrial endmembers 
are solely source-based, and independent of transport or mobilization 
route, meaning that both ICD-PF and topsoil-PF can be delivered by 
coastal, delta and riverbank erosion as well as river transport. The 
resulting isotopic mass-balance model shows contributions of marine 
(planktonic) organic carbon to the shelf sediments ranging between 
7% nearshore and 54% on the outer shelf, whereas topsoil-PF contri- 
butes ~30-35% close to land, decreasing to ~5% farther out (Fig. 2c). 
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ICD-PF constitutes 36-76% of the sedimentary organic carbon 
throughout the broad shelf, despite its largely coastal delivery. ICD- 
OC is ballasted by mineral association and rapidly settles*’*, 
whereupon it is probably resuspended from the sea floor and dispersed 
over the shelf, mostly by bottom-boundary-layer transport*’**. Old 
permafrost-released erosional carbon thus dominates burial of organic 
carbon on the ESAS. 

We estimate the net sediment burial of ICD-PF carbon using 
accumulation fluxes from sediment cores (36+17gO0Cm “yr °; 
all confidence intervals are 95%, unless otherwise stated; Fig. 2c, 
Supplementary Table 2). This was scaled up by the fraction of sea floor 
that is available for carbon burial (0.6), corresponding to water depth 
>30m (Supplementary Fig. 2), where resuspension is negligible and 
sediments thus accumulate’. Combining the ESS shelf area 
(9.87 X 10° km?) with the ICD-PF contribution to the sediment 
organic carbon (ESS only: 57+ 1.6%; Supplementary Table 5) 
yields an overall annual ICD-PF carbon accumulation flux of 
12 +8 TgCyr '. Inclusion of the Laptev Sea increases this value to 
20+ 8TgCyr ' (Supplementary Table 6). Hence, this approach 
reveals that the supply of carbon from ICD-PF erosion to the ESAS 
is much larger than has previously been assumed*"*”°. 

The biogeochemical composition of the eroding slopes of Muostakh 
Island (Fig. 3) indicates extensive organic matter degradation of the 
thawing ICD before delivery to the ocean. Recurring trends were 
observed in several properties between higher and lower elevations 
on the investigated slopes that are consistent with continuing degra- 
dation (Fig. 3; Supplementary Tables 7, 8), specifically: decreasing soil 
organic carbon content; increasing 5'°C of organic carbon (5'°Coc); 
decreasing A'Coc; decreasing ratio of high-molecular-weight 
n-alkanoic acids to high-molecular-weight n-alkanes; increasing ratio 
of even, low-molecular-weight to odd, high-molecular-weight 
n-alkanes; and increase in atmospheric CO, venting, deduced from 
field-chamber soil respiration measurements (Supplementary Methods). 
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Figure 3 | Biogeochemical signals of Ice Complex organic matter 
degradation on Muostakh Island. a, Study area. b, Distribution of CO, 
outgassing. c—g, Distributions along the four studied slopes (positions indicated 
in b) of soil organic carbon content (c); 8'°C-OC signal (d); A’*C-OC signal 
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These trends and fluxes contrast with prior assumptions that all thawed 
and erosion-mobilized ICD-OC is directly flushed into the sea without 
sub-aerial degradation'*””°. The elemental, isotopic and molecular 
data imply 66 + 16% (mean + s.d.; Supplementary Methods) down- 
slope degradative loss of ICD-OC. 

Combining the 20 + 8 TgCyr ' sediment re-burial flux of thawed 
old organic carbon with a recent estimate of water-column degrada- 
tion of terrestrially derived particulate organic carbon on the ESAS of 
14 yr’ (25+1.6 TgCyr '; mean+s.d.)”” suggests an ICD-PF 
organic carbon flux to the marine system of 22+8 TgCyr ' 
(Supplementary Fig. 1). Assuming an equal contribution of this flux 
from coastline and subsea erosion (Supplementary Table 6, which also 
includes 25/75% and 75/25% models), the 66 + 16% carbon loss along 
the eroding coastal slopes corresponds to a carbon venting (presumably 
mostly CO) from the ICD of 22 + 8 Tgyr ' (Supplementary Fig. 1). 
The total remobilization of old organic carbon from thawing of ICD-PF 
is thus ~44+ 10 TgCyr ' (Supplementary Table 6; Supplementary 
Fig. 1). 

The present assessment suggests a substantially larger flux of carbon 
from thawing ICD permafrost (44+ 10 TgCyr '; Supplementary 
Table 6) than has been inferred previously from exclusively land-based 
surveys (~4Tg Cyr — 's no error reported)'*. Previous estimates of ICD 
erosion may have been too low for several reasons, including gross 
upscaling from limited point measurements of ICD retreat rates!””?”?. 
In addition, upscaling using digital shoreline length data leads to 
considerable underestimations”; and potentially large inputs from 
retrogressive thaw slumps and slope failure” are excluded when eleva- 
tion change data are not included in coastline retreat measurements. 
Finally, bottom erosion is a previously neglected but potentially 
important contributor of old eroded organic carbon to the modern 
biogeochemical cycle on the ESAS, with erosion rates of 10-30 cm yr | 
(refs 18,29) at depths less than 30 m (nearly half the ESAS), where 
present-day bottom-water temperatures in summer are 2-3 °C and 
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(e); ratio of high-molecular-weight n-alkanoic acids to high-molecular-weight 
n-alkanes (proxy for degradation status) (f) and ratio of even, low-molecular- 
weight n-alkanes to odd, high-molecular-weight n-alkanes (proxy for bacterial 
biomass relative to substrate) (g). Ratios in f and g are molecular ratios. 
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have risen during the past decade’’. Thermal collapse of the carbon- 
rich, permafrost-laden coastlines and sea floors may accelerate with 
Arctic amplification of climate warming, and could further intensify 
the role of old Ice Complex organic carbon in carbon cycling in the 
world’s largest shelf sea. 


METHODS SUMMARY 


Surface sediments were collected on several expeditions on the ESAS in 2004, 2005, 
2007 and 2008 (Supplementary Fig. 2, Supplementary Tables 1 and 9). The 
samples were analysed for organic carbon content and 5'°C (UC Davis Stable 
Isotope Facility, USA) and AC (US National Ocean Sciences Accelerator Mass 
Spectrometry (NOSAMS) Facility of the Woods Hole Oceanographic Institution, 
USA). The relative contributions of three endmember sources—Coastal Ice 
Complex permafrost (ICD-PF: 8BC = —26.3 0.67%; AMC = —940 + 84%; 
Supplementary Table 4); topsoil permafrost (topsoil-PF: 8'°C = —28.2 + 1.96%; 
A™“C = —126 + 54%; Supplementary Table 3); and marine organic carbon 
(58°C = —24 + 3.0%, AC = 60 + 60%o; Supplementary Figs 4, 5)—to the surface 
sediment organic carbon content were quantified using a dual-carbon-isotope 
mixing model, solved with a Monte Carlo simulation approach (Supplementary 
Table 3). Radiochronological measurements on sediment cores from the ESAS were 
performed at Stockholm University and at the Radiation Research Division of the 
Riso National Laboratory for Sustainable Energy, Denmark (Supplementary Table 
10, Supplementary Fig. 3). Total inventories of excess *'°Pb were used to calculate 
the annual sediment organic carbon accumulation on the ESAS (Supplementary 
Table 2). The average contribution of organic carbon from ICD-PF in the surface 
sediment was then used to infer the annual sediment organic carbon accumulation 
from ICD-PF to the ESAS. 

Ice Complex samples from the slopes of Muostakh Island were collected in July 
2006 (Fig. 3, Supplementary Table 7). Bulk organic carbon and 5'°C analyses were 
performed at Stockholm University (Department of Geological Sciences) and 
A™C analyses at NOSAMS. The soil samples were extracted and separated for 
identification of molecular biomarkers using gas chromatography/mass spectro- 
metry. In addition, soil respiration measurements were collected on Muostakh 
Island slopes with automatic lid chambers equipped with infrared gas analysers 
(Fig. 3; Supplementary Table 8). Full details of methods are available in 
Supplementary Methods. 
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Recent Antarctic Peninsula warming relative to 
Holocene climate and ice-shelf history 


Robert Mulvaney, Nerilie J. Abram!?, Richard C. A. Hindmarsh, Carol Arrowsmith’, Louise Fleet!, Jack Triest!, Louise C. Sime!, 


Olivier Alemany* & Susan Foord't 


Rapid warming over the past 50 years on the Antarctic Peninsula 
is associated with the collapse of a number of ice shelves and 
accelerating glacier mass loss'’. In contrast, warming has been 
comparatively modest over West Antarctica and significant 
changes have not been observed over most of East Antarctica®’, 
suggesting that the ice-core palaeoclimate records available from 
these areas may not be representative of the climate history of the 
Antarctic Peninsula. Here we show that the Antarctic Peninsula 
experienced an early-Holocene warm period followed by stable 
temperatures, from about 9,200 to 2,500 years ago, that were sim- 
ilar to modern-day levels. Our temperature estimates are based on 
an ice-core record of deuterium variations from James Ross Island, 
off the northeastern tip of the Antarctic Peninsula. We find that the 
late-Holocene development of ice shelves near James Ross Island 
was coincident with pronounced cooling from 2,500 to 600 years 
ago. This cooling was part of a millennial-scale climate excursion 
with opposing anomalies on the eastern and western sides of the 
Antarctic Peninsula. Although warming of the northeastern 
Antarctic Peninsula began around 600 years ago, the high rate of 
warming over the past century is unusual (but not unprecedented) 
in the context of natural climate variability over the past two 
millennia. The connection shown here between past temperature 
and ice-shelf stability suggests that warming for several centuries 
rendered ice shelves on the northeastern Antarctic Peninsula 
vulnerable to collapse. Continued warming to temperatures that 
now exceed the stable conditions of most of the Holocene epoch is 
likely to cause ice-shelf instability to encroach farther southward 
along the Antarctic Peninsula. 

The Antarctic Peninsula is at present one of the most rapidly warm- 
ing regions on Earth’ (Fig. 1a). Historical observations since 1958 at 
Esperanza Station (Fig. 1b) document warming equivalent to 
3.5 +£0.8°C per century. During this time, a series of ice shelves 
stretching from Prince Gustav Channel to the Larsen B ice shelf on 
the northeastern Antarctic Peninsula have been lost*°, causing an 
acceleration of the feeder glaciers that drain ice from the Antarctic 
Peninsula’. To assess these recent rapid changes, a longer-term per- 
spective on Antarctic Peninsula climate and the role of past atmo- 
spheric temperature in determining ice-shelf stability is urgently 
needed’. To address this, we drilled an ice core to the bed of the ice 
cap on James Ross Island (JRI). This site lies off the northeastern tip of 
the Antarctic Peninsula, adjacent to the area that has witnessed a series 
of ice-shelf collapses since 1995 (Fig. 1b). 

The 363.9-m-long JRI ice core provides a temperature reconstruc- 
tion, based on deuterium/hydrogen isotope ratios of the ice (6D), that 
spans the entire Holocene and extends into the last glacial interval 
(Fig. 2, Methods Summary and Supplementary Fig. 1). Evidence of 
the glacial age ice is found in the final 5m of the JRI ice core; 
initial estimates suggest the record may extend to ~50,000 yr Bp (by 


convention, 0 yr BP means AD 1950), although an unrealistically rapid 
isotopic transition implies that an unconformity may be present in the 
early deglacial interval of the ice core. Taking into account changes in 
ocean isotopic values'*”’, the isotopic composition of the glacial ice on 
JRI is equivalent to temperatures that were approximately 6.1 + 1.0°C 
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Figure 1 | Regional and climatic setting of the Antarctic Peninsula. 

a, Temperature trends for the 50 years from 1958 to 2008 show the rapid 
regional warming of the Antarctic Peninsula. Trends are shown for January- 
December annual averages of gridded land and ocean surface temperature 
data’’**. b, James Ross Island (JRI) is located near the northeastern tip of the 
Antarctic Peninsula, within the zone of rapid regional warming, and adjacent to 
the former Prince Gustav, Larsen A and Larsen B ice shelves. 
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Figure 2 | Isotope and depth-age profiles of the JRI ice core. The 5D isotope 
profile for the JRI ice core is shown in terms of the 100-yr average (black) for the 
whole of the Holocene and the 10-yr average (grey) for 4,000 yr BP to present. 
The JRI depth-age scale, JRI-1 (blue), was constructed on the basis of a 
glaciological flow model for this site (red) with adjustment derived from fixed 
time markers (black diamonds; horizontal error bars give estimated age 
uncertainty for the fixed markers). Further details are provided in Methods and 
Supplementary Table 1. 


cooler than present (where by present we mean AD 1961-1990) during 
the Last Glacial Maximum’? (LGM). By comparison, the LGM is 
found to have been 7.4 °C cooler in Dronning Maud Land and 9.3 
°C cooler at Dome C on the East Antarctic plateau’. 

The reduced magnitude of LGM-Holocene temperature change on 
the Antarctic Peninsula probably reflects its more northerly position 
and proximal maritime influence. An alternative explanation could 
be that the JRI ice cap experienced changes in elevation at the LGM, 
making this site seem isotopically warmer than continental Antarctica. 
However, this interpretation would require that the JRI ice cap at the 
LGM was ~ 150-360 m lower than present’’, according to Dronning 
Maud Land and Dome C temperatures'*. Such a reduction is in- 
consistent with glaciological evidence that the JRI ice cap had a con- 
fluence with the Antarctic Peninsula ice sheet in the Prince Gustav 
Channel until the early Holocene’’. The JRI ice core thus adds to the 
glaciological history of the northern Antarctic Peninsula, with the 
reduced LGM-Holocene isotope contrast implying that the ice cap 
cannot have thickened significantly at the LGM and was not overrun 
by isotopically colder ice from the south. 

The Holocene temperature history from the JRI ice core is charac- 
terized by an early-Holocene climatic optimum that was 1.3 + 0.3 °C 
warmer than present (Fig. 3). The magnitude and progression of this 
early-Holocene optimum is similar to that observed in ice-core records 
from the main Antarctic continent'®. A marine sediment record from 
off the shore of the western Antarctic Peninsula also shows an early- 
Holocene optimum during which surface ocean temperatures were 
determined to be ~3.5°C higher than present'’. Other evidence 
suggests that the George VI ice shelf on the southwestern Antarctic 
Peninsula was absent during this early-Holocene warm interval but 
reformed in the mid Holocene’. 

Following this widespread early-Holocene climate optimum, tem- 
perature on the Antarctic Peninsula decreased and the JRI ice core 
documents a long interval of stable climate that persisted from ~9,200 
to 2,500 yr Bp (Fig. 3). During this interval, the mean temperature 
anomaly, of 0.2+0.2°C, indicates that conditions at JRI were 
comparable to the warm conditions observed at this site over recent 
decades. Likewise, marine temperatures on the western side of the 
Antarctic Peninsula’’ declined to reach, by ~8,000 yr sp, a long-term 
mean that was close to present-day values. Within this interval of mid- 
Holocene stability, the JRI isotope record indicates that from ~5,000 
to 3,000 yrBp conditions may have been only marginally warmer 
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Figure 3 | Holocene temperature history of the Antarctic Peninsula. The JRI 
ice-core temperature reconstruction relative to the 1961-1990 mean (black 
trace, 100-yr average; the grey band indicates the standard error of the 
calibration dependence) is shown alongside a sea surface temperature (SST) 
reconstruction from off the shore of the western Antarctic Peninsula (blue 
curve)’’, and temperature reconstructions from the Dome C (red)”’ and 
Dronning Maud Land (green)” ice cores from East Antarctica. Horizontal bars 
show intervals in the Holocene when marine sediment cores indicate that open 
water was present in the area of the Prince Gustav (black; top to bottom are 
north to south core sites; original '“C ages have been calibrated)? and Larsen A 
(grey)° ice shelves, which collapsed in ap 1995. 


than present. Various proxy evidence exists for a mid-Holocene warm 
period on the Antarctic Peninsula’, although the lack ofa consensus on 
its timing in this region may be explained by the small magnitude of 
this feature in the JRI temperature record compared with the well- 
defined mid-Holocene climate optimum in continental Antarctic ice- 
core records’®. 

The Holocene ice-shelf history along the eastern Antarctic 
Peninsula shows a strong connection to Antarctic Peninsula tempera- 
tures. Following the deglacial transition from grounded to floating ice 
in Prince Gustav Channel at ~ 10,000 to 8,000 yr Bp*"», this area experi- 
enced intervals of seasonally open water through to ~1,500 yr Be”. 
Marine sediments indicate that a permanent ice shelf was established 
there only after ~1,500 yr BP and that the maximum ice-shelf extent 
may have been reached as recently as a few centuries ago’. Farther 
south, there is evidence for instability of the Larsen A ice shelf between 
3,800 and 1,400 yr Bp’. Farther south again, the Larsen B ice shelf 
probably remained intact throughout the Holocene, although there 
is evidence that the ice shelf was progressively weakened by melting*. 
Combining the JRI temperature reconstruction with the marine sedi- 
ment evidence shows that temperatures similar to present occurred in 
this region for much of the Holocene, resulting in a regime in which ice 
shelves were only transient features along the northern-most part of 
the eastern Antarctic Peninsula and were undergoing decay farther to 
the south. An additional new perspective is that recent warming to 
levels consistent with the mid Holocene meant that the ice shelves 
along the northeastern Peninsula were poised for the succession of 
collapses observed there over recent decades. 
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The late-Holocene development of ice shelves fed from the 
northeastern Antarctic Peninsula seems to be related to millennial- 
scale climate variability in the region (Figs 3 and 4a). After 2,500 yr BP, 
the JRI isotope record documents pronounced cooling to temperatures 
that were on average 0.7 + 0.3 °C cooler than present between 800 and 
400 yr BP (AD 1150-1550), and on a decadal timescale temperatures 
may have at times been more than 1.8 + 0.3 °C cooler than present. 
Late-Holocene cooling has also been inferred from northeastern 
Antarctic Peninsula lake records”’*. The prominent millennial-scale 
cooling at JRI is matched by a similarly prominent but warm excursion 
in marine temperatures to the west of the Antarctic Peninsula’””’. On 
the central spine of the Antarctic Peninsula, a 500-yr-long ice-core 
record from the Dyer Plateau shows that temperatures here were 
approximately the same as present at 450 yr BP*’, suggesting an east- 
west divide across the Antarctic Peninsula in this late-Holocene 
climate oscillation. Thus, although glacial-scale climate changes have 
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Figure 4 | Two-thousand-year climate history of the Antarctic Peninsula. 
a, The JRI temperature reconstruction (black trace, 100-yr average; grey trace, 
10-yr average; relative to 1961-1990 mean) is shown alongside the SST record 
from Ocean Drilling Program site 1098 to the west of the Antarctic Peninsula” 
(blue curve) and the reconstructed Northern Hemisphere temperature 
anomaly” (dark green curve, relative to 1961-1990 mean; the light green 
envelope indicates the 95% confidence interval). Whereas SST to the west of the 
Antarctic Peninsula shows similarities to Northern Hemisphere climate over 
the past 2,000 yr, the JRI record shows an opposing temperature excursion 
which demonstrates that the Antarctic Peninsula did not experience a 
widespread Medieval Warm Period/Little Ice Age sequence comparable to 
Northern Hemisphere climate at that time. Warming at JRI has been ongoing 
for several centuries, although the warming by 1.56 °C over the past 100 yr (red 
lines in a and b) is highly unusual in the context of natural variability. b, This is 
shown by a histogram analysis of temperature trends calculated in moving 100- 
yr windows of annual-resolution data from the JRI ice core starting at 

2,000 yr BP. 
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been consistent across the whole of the Antarctic Peninsula region, 
millennial-scale climate variability was particularly strong during the 
late Holocene and seems to have been characterized by opposing east- 
west temperature anomalies across the Antarctic Peninsula. 

Opposing temperature anomalies on either side of the Antarctic 
Peninsula are a feature of the Antarctic dipole, which is an interannual 
standing-wave pattern that results in opposite temperature and sea ice 
anomalies between the Weddell Sea and the Amundsen and 
Bellingshausen seas*'. The observation of similar opposing climate 
oscillations on a millennial scale provides an indication that the 
Antarctic dipole may also influence long-term climate changes in 
the Antarctic Peninsula region. Deducing the exact mechanisms that 
have driven this late-Holocene Antarctic-dipole-like pattern will 
require additional, well-dated palaeoclimate reconstructions to map 
the spatial extent of the climate anomalies. We note, however, that the 
development of this Antarctic-dipole-like feature during the late 
Holocene coincides with the well-documented maximum in El Nifo 
activity (Supplementary Fig. 2), which hasa role in driving present-day 
variability of the Antarctic dipole”’. Antarctic-dipole-like cooling of 
the Weddell Sea in the late Holocene, and the propagation of these 
ocean temperature and sea ice anomalies along the eastern Antarctic 
Peninsula by the Weddell gyre, may have also aided the rapid 
establishment of ice shelves in this region during the late Holocene. 

Sustained warming at JRI began ~600 yr ago (Fig. 4a). Lake sedi- 
ments from Beak Island in Prince Gustav Channel also indicate warm- 
ing beginning at ~AD 1410", and together these records demonstrate 
the absence of a widespread Little Ice Age signal on the Antarctic 
Peninsula that was comparable to Northern Hemisphere climate” 
(Fig. 4a). The overall rate of pre-anthropogenic temperature increase 
at JRI from AD 1400 to AD 1850 equates to 0.22 + 0.06 °C per century. 
However, there are times in this interval when warming occurred 
much faster. Using annual-resolution data, trends were calculated 
for the JRI temperature record since 2,000 yr BP over moving 100-yr 
intervals stepped in 1-year increments (yielding 1,958 100-year ana- 
lysis windows) (Fig. 4b). This analysis indicates that rapid warming 
trends exceeding 1.5 °C per century occurred at JRI during the intervals 
spanning AD 1518-1621 and AD 1671-1777, and that trends exceeding 
1.25 °C per century occurred during the interval AD 296-415. 

Over the past 100 yr, the JRI ice-core record shows that the mean 
temperature there has increased by 1.56 + 0.42 °C (Fig. 4a). This ranks 
as one of the fastest (upper 0.3%) warming trends at JRI since 
2,000 yr BP, according to the moving 100-yr analysis windows, demon- 
strating that rapid recent warming of the Antarctic Peninsula is highly 
unusual although not outside the bounds of natural variability in the 
pre-anthropogenic era (Fig. 4b). The JRI ice core shows that the recent 
phase of warming on the northern Antarctic Peninsula began in the mid 
1920s and that over the past 50 yr the temperature has risen at a rate 
equivalent to 2.6 + 1.2 °C per century. Repeating the temperature trend 
analysis using 50-yr windows confirms the finding that the rapidity of 
recent Antarctic Peninsula warming is unusual but not unprecedented. 

The long-term climate history provided by the JRI ice core shows 
that natural millennial-scale climate variability has resulted in warm- 
ing on the eastern Antarctic Peninsula that has been ongoing for a 
number of centuries and had left ice shelves in this area vulnerable to 
collapse during the recent phase of rapid warming. If warming con- 
tinues in this region, as is suggested by its attribution in part to rising 
atmospheric greenhouse gas concentrations””’, then temperatures will 
soon exceed the stable conditions that persisted in the eastern 
Antarctic Peninsula for most of the Holocene. The association between 
atmospheric temperature and ice-shelf stability in the past demon- 
strates that as warming continues ice-shelf vulnerability is likely to 
progress farther southwards along the Antarctic Peninsula coast to 
affect ice shelves that have been stable throughout the Holocene, 
and may make them particularly susceptible to changes in oceano- 
graphic forcing™*. 
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METHODS SUMMARY 

The JRI ice core discussed in this study was drilled in January-February 2008 at a 
site (57° 41.10’ W, 64° 12.10’ S, 1,542-m elevation; Fig. 1) near the summit of 
Mount Haddington. The ice core was recovered to bedrock at a depth of 
363.9m. The mean annual temperature at this site is —14.4°C, and the mean 
annual snow accumulation is 0.63 m water equivalent'*”*. The Holocene age scale 
for the JRI ice core, termed JRI-1, is based on a glaciological flow model with 
additional age control provided by fixed time markers derived from local and 
global volcanic events (Fig. 2 and Supplementary Table 1). The temperature 
reconstruction was based on deuterium isotope (5D) measurements (expressed 
relative to the international standard Vienna Standard Mean Ocean Water 
(VSMOW) along the length of the ice core, with a typical precision of 1.0%v. 
Temperature anomalies were calculated using a 5D-temperature dependence of 
6.4 + 1.3%0 °C | (ref. 12), under the assumption that the modern-day calibration 
holds over the entire record’, and are given with reference to AD 1961-1990. 
Consistent palaeotemperature results are produced using the oxygen isotopic ratio 
(8'°O) of the ice (Supplementary Fig. 1), confirming that the isotopic record 
primarily reflects changes in temperature at the JRI site during the Holocene. 
Uncertainties in mean temperature anomalies are the combined standard error 
of the calibration dependence and standard deviation of the variability of 100-yr- 
binned data. 


Full Methods and any associated references are available in the online version of 
the paper. 


Received 11 November 2011; accepted 29 June 2012. 
Published online 22 August 2012. 


1. Vaughan, D. G. et al. Recent rapid regional climate warming on the Antarctic 
Peninsula. Clim. Change 60, 243-274 (2003). 

2. Pudsey, C. J. & Evans, J. First survey of Antarctic sub-ice shelf sediments reveals 
mid-Holocene ice shelf retreat. Geology 29, 787-790 (2001). 

3. Pudsey, C. J., Murray, J. W., Appleby, P. & Evans, J. Ice shelf history from 
petrographic and foraminiferal evidence, Northeast Antarctic Peninsula. Quat. Sci. 
Rev. 25, 2357-2379 (2006). 

4. Domack,E. etal. Stability of the Larsen Bice shelf on the Antarctic Peninsula during 
the Holocene epoch. Nature 436, 681-685 (2005). 

5. Brachfeld, S. et al. Holocene history of the Larsen-A Ice Shelf constrained by 
geomagnetic paleointensity dating. Geology 31, 749-752 (2003). 

6. Cook, A.J., Fox, A.J., Vaughan, D. G. & Ferrigno, J. G. Retreating glacier fronts on the 
Antarctic Peninsula over the past half-century. Science 308, 541-544 (2005). 

7. Bentley, M. J. et al. Mechanisms of Holocene palaeoenvironmental change in the 
Antarctic Peninsula region. Holocene 19, 51-69 (2009). 

8. Turner, J. etal. Antarctic climate change during the last 50 years. Int. J. Climatol. 25, 
279-294 (2005). 

9.  Steig, E. J. et al, Warming of the Antarctic ice-sheet surface since the 1957 
International Geophysical Year. Nature 457, 459-462 (2009). 

0. Jouzel, J. et al. Magnitude of isotope/temperature scaling for interpretation of 
central Antarctic ice cores. J. Geophys. Res. 108, 4361 (2003). 

1. Bintanja, R., van de Wal, R. S. W. & Oerlemans, J. Modelled atmospheric 
temperatures and global sea levels over the past million years. Nature 437, 
125-128 (2005). 

2. Abram, N. J., Mulvaney, R. & Arrowsmith, C. Environmental signals in a highly 
resolved ice core from James Ross Island, Antarctica. J. Geophys. Res. 116, 
D20116 (2011). 

3. Masson-Delmotte, V. etal. A review of Antarctic surface snow isotopic composition: 
observations, atmospheric circulation, and isotopic modeling. J. Clim. 21, 
3359-3387 (2008). 

A. Stenni, B. et a/. The deuterium excess records of EPICA Dome C and Dronning 
Maud Land ice cores (East Antarctica). Quat. Sci. Rev. 29, 146-159 (2010). 


144 | NATURE | VOL 489 | 6 SEPTEMBER 2012 


5. Johnson, J. S., Bentley, M. J., Roberts, S. J., Binnie, S.A. & Freeman, S. P. H. T. 
Holocene deglacial history of the northeast Antarctic Peninsula: a review and new 
chronological constraints. Quat. Sci. Rev. 30, 3791-3802 (2011). 

6. Masson-Delmotte, V. etal. Acomparison of the present and last interglacial periods 
in six Antarctic ice cores. Clim. Past 7, 397-423 (2011). 

7. Shevenell, A. E., Ingalls, A. E., Domack, E. W. & Kelly, C. Holocene Southern Ocean 
surface temperature variability west of the Antarctic Peninsula. Nature 470, 
250-254 (2011). 

8. Sterken, M. et al. Holocene glacial and climate history of Prince Gustav Channel, 
northeastern Antarctic Peninsula. Quat. Sci. Rev. 31, 93-111 (2012). 

9. Hall, B. L., Koffman, T. & Denton, G. H. Reduced ice extent on the western Antarctic 
Peninsula at 700-970 cal. yr BP. Geology 38, 635-638 (2010). 

20. Thompson, L.G. etal. Climate since 1520 AD on Dyer Plateau, Antarctic Peninsula: 

evidence for recent climate change. Ann. Glaciol. 20, 420-426 (1994). 

21. Yuan, X.J.ENSO-related impacts on Antarctic sea ice: a synthesis of phenomenon 
and mechanisms. Antarct. Sci. 16, 415-425 (2004). 

22. Mann, M. E. etal. Proxy-based reconstructions of hemispheric and global surface 
temperature variations over the past two millennia. Proc. Nat! Acad. Sci. USA 105, 
13252-13257 (2008). 

23. Bracegirdle, T. J., Connolley, W. M. & Turner, J. Antarctic climate change over the 
twenty first century. J. Geophys. Res. 113, D03103 (2008). 

24. Hodgson, D.A. First synchronous retreat of ice shelves marks a new phase of polar 
deglaciation. Proc. Natl Acad. Sci. USA 108, 18859-18860 (2011). 

25. Aristarain, A. J., Delmas, R. J. & Stievenard, M. Ice-core study of the link between 
sea-salt aerosol, sea-ice cover and climate in the Antarctic Peninsula area. Clim. 
Change 67, 63-86 (2004). 

26. Sime, L.C., Tindall, J. C., Wolff, E. W., Connolley, W. M. & Valdes, P. J. Antarctic 
isotopic thermometer during a CO2 forced warming event. J. Geophys. Res. 113, 
D24119 (2008). 

27. Hansen, J., Ruedy, R., Sato, M. & Lo, K. Global surface temperature change. Rev. 
Geophys. 48, RG4004 (2010). 

28. Smith, T. M., Reynolds, R. W., Peterson, T. C. & Lawrimore, J. Improvements to 
NOAA's historical merged land-ocean surface temperature analysis (1880-2006). 
J. Clim. 21, 2283-2296 (2008). 

29. EPICACommunity Members. Eight glacial cycles from an Antarctic ice core. Nature 
429, 623-628 (2004). 

30. EPICA Community Members. One-to-one coupling of glacial climate variability in 

Greenland and Antarctica. Nature 444, 195-198 (2006). 


Supplementary Information is available in the online version of the paper. 


Acknowledgements We thank our colleague in the field, S. Shelley, who took part in the 
ice-core drilling project; the captain and crew of HMS Endurance, who provided 
logistical support for the drilling field season; S. Kipfstuhl and the Alfred Wegner 
Institute at Bremerhaven for assistance in the processing of the ice core; J. Smellie and 
S. Roberts for discussions on Antarctic Peninsula tephras; D. Hodgson and E. Wolff for 
comments during preparation of the manuscript; and E. Capron, N. Lang, J. Levine and 
E. Ludlow for laboratory assistance. This study is part of the British Antarctic Survey 
Polar Science for Planet Earth Programme and was funded by the Natural Environment 
Research Council. Support from the Institut Polaire Frangais - Paul Emile Victor (IPEV), 
and from the Institut National des Sciences de I’Univers in France (INSU/PNEDC 
“AMANCAY” project), facilitated by J. Chappellaz and F. Vimeux, enabled the technical 
contribution of the French National Center for Drilling and Coring (INSU/C2FN). 


Author Contributions R.M. designed the project. R.M., N.J.A. and R.C.A.H. constructed 
the age scale, and R.M., NJ.A., C.A., LF. and J.T. performed the isotopic, chemical and 
physical measurements to characterize the ice. R.M., N.J.A.,J.T.,L.C.S.,O.A.and S.F. were 
involved with the logistics and fieldwork that enabled the ice-core drilling. RM. and 

N.J.A. co-wrote the manuscript. 


Author Information Reprints and permissions information is available at 
www.nature.com/reprints. The authors declare no competing financial interests. 
Readers are welcome to comment on the online version of the paper. Correspondence 
and requests for materials should be addressed to R.M. (rmu@bas.ac.uk). 


©2012 Macmillan Publishers Limited. All rights reserved 


METHODS 

Site details. The JRI ice core presented in this study was drilled in January-February 
2008 at a site (57° 41.10’ W, 64° 12.10’ S, 1,542-m elevation; Fig. 1) near the summit 
of Mount Haddington. The ice core was recovered to bedrock at a depth of 363.9 m 
using an electromechanical drill and winch system and a fluid-filled borehole after 
the firn-ice transition. Annual layers determined by chemistry measurements 
record a mean annual snow accumulation at this site of 0.63-m water equivalent’. 
Borehole temperature measurements indicate a mean annual site temperature of 
—144°C, in agreement with earlier studies at this site’. The basal temperature of 
the ice sheet measured in the borehole was —8.5°C, which is consistent with a 
normal geothermal heat flux of around 50 mW m * at this location. 

Age scale. The Holocene age profile for the JRI ice core is identified as the JRI-1 age 
scale (Fig. 2). It is based on a glaciological flow model that accounts for firn 
compaction and characterizes the expected vertical and horizontal ice flow caused 
by internal deformation, plug flow and Raymond-Reeh flow. Application of the 
glaciological model uses the assumption that flow at the site has not changed 
through time, which is expected to be a reasonable first-order assumption over 
the Holocene interval that we focus on here. The glaciological flow model was run 
using the mean annual site temperature, snow accumulation, ice-sheet thickness 
and geothermal heat flux (see above) as input parameters. A number of fixed time 
markers were then used to make adjustments to the modelled depth-age profile. 
These fixed time markers include the local Deception Island eruption tephra in 
December 1967 (ref. 12), the global-scale sulphate anomaly caused by the 1815 
eruption of Mount Tambora, the aD 1259 volcanic sequence seen in dielectric 
profiling of this core, and matching of 14 tephra layers in the JRI ice core to widely 
documented tephra horizons in marine and lake sediment cores from the 
Antarctic Peninsula region. The isotopic anomaly of the Antarctic cold reversal 
was also used to connect the lower portion of the modelled JRI chronology to the 
EDC3 age scale. Age control on the tephra horizons used for refining the 
chronology is derived from radiocarbon dating, and the estimated age uncertainty 
in the early Holocene is +500 yr, that in the mid Holocene is +200 yr and that in 
the late Holocene is +100 yr. For the AD 1259 and Tambora eruption events, the 
estimated age uncertainties are +5 yr and +1 yr, respectively. Full details of the 
time markers used to establish the Holocene JRI-1 age scale and their estimated 
uncertainties are provided in Supplementary Table 1. 
Analytical details. Deuterium isotope (6D) measurements were made along the 
whole length of the ice core at the NERC Isotope Geosciences Laboratory using an 
online chromium reduction method with a EuroPyrOH-3110 system coupled to a 
Micromass Isoprime mass spectrometer. Analytical precision is typically 1.0%o for 
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SD. Measurements were made at 11-cm resolution from the surface to a snow 
depth of 300m, at 5-cm resolution from 300 to 350m, and at approximately 
1.5-cm resolution from 350 m to bedrock. Duplicate measurements of 5D were 
also made at the British Antarctic Survey using a Los Gatos Research DLT-100 
cavity ring-down laser spectroscopy instrument with a precision of typically 1.0%o 
for 5D. Across 770 duplicates, the mean difference in 5D results obtained by the 
mass spectrometry and laser spectroscopy methods is 1.02 %o. A total of 5,116 
discrete 5D results were used for the temperature reconstruction. Oxygen isotope 
(5'O) measurements were made at the NERC Isotope Geosciences Laboratory, 
using the CO, equilibration method with a VG Isoprep 18 device and a VG SIRA 
10 mass spectrometer. The 5'5O measurements have a typical precision of 0.08% 
and the data presented in Supplementary Fig. 1 is comprised of 4,592 analyses. The 
relationship between 5'%O and 8D in the JRI ice-core data has a slope of 8.02, 
which is consistent with the meteoric water line. Isotope measurements used 
internal standards calibrated against the international standards Vienna 
Standard Mean Ocean Water (VSMOW2) and Vienna Standard Light Antarctic 
Precipitation (VSLAP2). 

Temperature reconstruction. A comparison with recent temperature records has 
shown that at this site SD has a temperature dependence of 6.4 + 1.3%0 °C! 
(ref. 12), consistent with the modern-day spatial D-temperature relationship across 
Antarctica’, For 5'°O, a temperature dependence of 0.80 + 0.14% °C’ was 
used'*"*, It has been shown that snowfall at this site occurs year round and does 
not seem to bias the isotopic record towards any specific season'*. Comparison of 
8D- and 8'°O-based temperature reconstructions, and calculation of the deuterium 
excess, also indicates that changes in source temperature have been negligible for this 
site and that the isotope history primarily reflects changes in temperature at the JRI 
site (Supplementary Fig. 1).'The temperature reconstruction was calculated using the 
assumption that the modern 6D-temperature calibration holds over the entire 
record and that any changes in the seasonality of snow fall have a negligible effect 
on the mean isotopic changes. This is believed to be a reasonable assumption for 
Antarctic ice cores extending through the Holocene and into the LGM’**"*, but may 
be less robust for climates significantly warmer than the present**. The temperature 
reconstruction also takes into account changes in the isotopic composition of the 
ocean using the method of ref. 10 and ocean isotope values calculated in ref. 11. 
Temperature anomalies were calculated with reference to the AD 1961-1990 interval 
of the JRI ice core, and mean temperature anomalies are reported with uncertainties 
that combine the standard error of the calibration dependence and the standard 
deviation of the 100-yr-binned data within each interval. For temperature trends, the 
certainty estimates denote the standard error of the trend determination. 
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Dopamine neurons modulate pheromone responses 
in Drosophila courtship learning 


Krystyna Keleman', Eleftheria Vrontou'+, Sebastian Kriittner', Jai Y. Yu't, Amina Kurtovic-Kozaric!+ & Barry J. Dickson! 


Learning through trial-and-error interactions allows animals to 
adapt innate behavioural ‘rules of thumb’ to the local environment, 
improving their prospects for survival and reproduction. Naive 
Drosophila melanogaster males, for example, court both virgin 
and mated females, but learn through experience to selectively 
suppress futile courtship towards females that have already mated’. 
Here we show that courtship learning reflects an enhanced res- 
ponse to the male pheromone cis-vaccenyl acetate (CVA), which 
is deposited on females during mating and thus distinguishes 
mated females from virgins. Dissociation experiments suggest a 
simple learning rule in which unsuccessful courtship enhances 
sensitivity to cVA. The learning experience can be mimicked by 
artificial activation of dopaminergic neurons, and we identify a 
specific class of dopaminergic neuron that is critical for courtship 
learning. These neurons provide input to the mushroom body 
(MB) y lobe, and the DopR1 dopamine receptor is required in 
MBy neurons for both natural and artificial courtship learning. 
Our work thus reveals critical behavioural, cellular and molecular 
components of the learning rule by which Drosophila adjusts its 
innate mating strategy according to experience. 

Mature virgin Drosophila females are usually willing to mate, 
whereas those that have recently mated are generally recalcitrant to 
further mating attempts. A male thus increases his overall mating 
success if he concentrates his courtship efforts on virgins. Given geo- 
graphic and seasonal fluctuations in the relative abundance of virgins 
and mated females, and the cues that distinguish them, the optimal 
courtship strategy is unlikely to be a species universal. A heuristic for 
approaching this optimum could, however, be universal, allowing 
evolution to select for genes that implement such a learning rule in 
the fly’s brain. 

A male’s courtship behaviour can be quantified by a courtship index 
(CI), and his ability to discriminate virgins from mated females by a 
discrimination index (DI), the relative reduction in the mean CI in 
single-pair assays with mated versus virgin females: DI = [CI, — Cl,,]/ 
CI,. In our assays, naive males courted mated females only marginally 
less vigorously than they courted virgins (DI = 13.8%; Fig. la, b and 
Supplementary Table 1a), whereas males that had experienced rejec- 
tion from mated females were subsequently much less active when 
courting mated females than virgins (DI = 51.6%; Fig. la, b and 
Supplementary Table 1a). The relative difference between the mean 
Cls of experienced (CI*) and naive (CI) males gives rise to a learning 
index: LI= [CI — CI*]/CI. For males trained with mated females, 
the LI was just 7.8% in tests with virgin females but 48.2% when tested 
with mated females (Fig. 1c, d and Supplementary Table 1b). Similar 
results were obtained when decapitated virgins were used as trainers 
(Fig. le, fand Supplementary Table 2), suggesting that male behaviour 
is conditioned by the failure to mate, not by active rejection from the 
female. 

To discriminate mated females from virgins, a male might detect 
either the subtle changes in female pheromones on mating’ or the 


telltale vestiges of male pheromones that linger on mated females’. 
The male-specific pheromone cVA is transferred to the female cuticle 
on mating”. It is not detectable on the cuticle of either males or virgin 
females°. Naive Or67d mutant males, which are unable to detect cVA**, 
courted virgin and mated females equally (DI = —0.4%) and did not 
benefit from training (LI= —3.0%; Fig. 1g, h and Supplementary 
Tables 3 and 4). In contrast, analogous mutations in either of two other 
candidate pheromone receptor genes”’®, Or47b and Gr68a, did not 
impair discrimination or learning (Fig. 1g, h and Supplementary 
Tables 3 and 4). cVA detection is therefore crucial for naive and experi- 
enced males to discriminate mated females from virgins. 

The salient feature of training might be the presence of cVA on the 
mated female, the lack of courtship success, or an association formed 
between the two. We designed a dissociation experiment to distinguish 
between these possibilities. Female post-mating behaviour, including 
courtship rejection, is triggered by sex peptide (SP), a male seminal 
fluid peptide transferred to the female during mating’’. Virgin females 
in which SP is transgenically expressed in the nervous system reject 
courting males’* (pseudomated females), whereas females that have 
mated with SP-null mutant males are still receptive'* (pseudovirgins). 
As expected, we detected cVA on the cuticle of both mated females and 
pseudovirgins (178.8 + 11.0and 57.5 + 14.7 ng per fly (means + s.e.m.), 
respectively; n = 3), but not on virgins or pseudomated females (n = 3). 
Thus, with pseudomated and pseudovirgin females the presence of cVA 
and sexual receptivity are fully dissociated. 

Pseudomated females were just as effective as genuinely mated 
females when used as trainers (Fig. li, j and Supplementary Table 5), 
whereas pseudovirgin females were not (Fig. 1k, 1 and Supplementary 
Table 6). In contrast, pseudovirgin but not pseudomated females were 
as effective as mated females when used as testers (Fig. li-l and 
Supplementary Tables 5 and 6). Indeed, robust courtship learning 
was observed when males were trained with pseudomated females 
and tested with pseudovirgins, but not vice versa (Fig. 1m, n and 
Supplementary Table 7). We therefore conclude that the salient feature 
of training is simply the lack of courtship success, not its association 
with cVA, and that training alters the male’s response to cVA or some 
other vestige of previous contact with another male. 

To test whether training does indeed alter sensitivity to cVA, we 
applied varying doses of cVA to pseudomated females and presented 
them as testers to naive and experienced males. As expected*®, high 
doses of cVA inhibited courtship by both naive and experienced males 
(Fig. lo and Supplementary Table 8). However, males trained with 
either mated or pseudomated females were inhibited by much lower 
doses of cVA than naive males were (Fig. lo and Supplementary Table 8). 
Courtship training did not enhance sensitivity to an unrelated aversive 
odorant (Supplementary Fig. 1). 

Dopamine is thought to provide a learning signal in a variety of 
different models and species, including aversive olfactory learning'** 
and conditioned suppression of male—male courtship'® in Drosophila. If 
dopamine also encodes an instructive signal during courtship learning, 
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Figure 1 | Experience enhances the behavioural response to cVA. 

a-f, Courtship (a, ¢, e), discrimination (b) and learning indices (d, f) of wild- 
type males. Trainer and tester females: -, none (naive males); m, mated female; 
v, virgin; dv, decapitated virgin. Box-and-whisker plots for CI show 10th, 25th, 
50th, 75th and 90th centiles and mean (+). Three asterisks, P< 0.001 
compared with naive male (b) or virgin tester (d); n.s., P > 0.05 compared with 
decapitated virgin trainers (f). g, h, Courtship index (g) and discrimination and 
learning indices (h) of Or67d, Or47b and Gr68a mutant males. n.s., P > 0.05; 
three asterisks, P< 0.001 compared with wild-type controls. i-m, Courtship 
(i, k, m) and learning (j, 1, n) indices of wild-type males in dissociation 


then artificial stimulation of dopaminergic neurons might mimic 
training with a mated female. To test this, we expressed the warmth- 
activated TrpAl channel'’? in most dopaminergic neurons'’, and 
attempted to ‘train’ naive isolated males by warming them briefly to 
30°C. When subsequently returned to 25°C and tested with mated 
females, the courtship activity of these males was indeed markedly 
reduced in comparison with that of control males (Fig. 2a, b and 
Supplementary Table 9). This suppression was specific for courtship 
towards mated but not virgin females, was dependent on a functional 
Or67d receptor (Fig. 2a, b and Supplementary Table 9), and was cor- 
related with an increased sensitivity to cVA (Fig. 2c and Supplementary 
Table 10). In these respects, activation of dopaminergic neurons thus 
mimics a specific courtship learning signal rather than a non-specific 
punishment signal that might be expected to suppress courtship more 
generally. Experiments in which we selectively activated various 
subsets of dopaminergic neurons further suggest that the neurons 
involved in courtship learning are distinct from those previously 
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experiments using pseudomated females (\ym, elav-GAL4 UAS-SP) and 
pseudovirgin females (\/v, wild-type females previously mated to SP-null 
mutant males). n.s., P > 0.05; three asterisks, P< 0.001 compared with assays 
with mated females as trainers and testers (j, 1), or the reciprocal assay (n). Post- 
mating behaviours are not completely eliminated in pseudovirgin females, 
because SP function can be partly compensated for by the related DUP99B 
peptide”’. o, Courtship indices of naive and experienced males towards 
pseudomated females perfumed with varying doses of cVA. P< 0.01 for all 
comparisons of experienced to naive males; P > 0.05 for all comparisons 
between males trained with mated versus pseudomated females. 


implicated in various forms of aversive olfactory learning’’”® (Fig. 2d, e 
and Supplementary Table 11). 

Many aspects of male courtship behaviour have been linked to the 
set of neurons that express the fruitless (fru) gene*'. Among these are 
the Or67d olfactory neurons (OSNs) and MBy neurons, both of which 
function in courtship learning (Fig. le, f and ref. 22). We speculated 
that the dopaminergic neurons involved in courtship learning might 
also be fru’ . To test this hypothesis we acutely blocked synaptic trans- 
mission of fru’ dopaminergic neurons by using shi (refs 23-25), 
which inhibits synaptic vesicle recycling at 30°C but not at 22 °C. 
Such males showed significantly impaired learning when trained at 
30°C and tested at 22°C, but not vice versa (Fig. 2f, g and 
Supplementary Table 12). These data thus establish a requirement 
for dopaminergic neurons in memory formation, not recall, and 
further indicate that the relevant cells are fru’. 

We previously identified two distinct classes of fru’ dopaminergic 
neurons: aSP4 and aSP13 (ref. 25) (Fig. 3a—g and Supplementary Fig. 2). 
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Figure 2 | Activation of dopaminergic neurons is necessary and sufficient 
for learning. a, b, Courtship (a) and ‘fictive learning’ (b) indices of males of the 
indicated genotypes. Before testing, isolated males were either retained at the 
normal culture temperature of 22 °C (—) or warmed to 30°C for 45 min (+). 
Three asterisks, P< 0.001 compared with TH-GAL4/+ Or67d* males. 

c, Courtship indices of naive and fictively trained males towards pseudomated 
females perfumed with various doses of cVA. P < 0.01 for all comparisons at a 
given cVA dose. d, e, Courtship (d) and ‘fictive learning’ (e) indices of males of 
the indicated genotypes. Three asterisks, P< 0.001; n.s., P> 0.05 compared 
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with corresponding control without UAS-trpA1. HL9-GAL4 includes most 
dopaminergic neurons, but not PPL1 cluster neurons implicated in olfactory 
learning in a heat punishment assay’”. The other GAL4 lines drive expression in 
MB-M3 or MB-MP1 neurons, implicated in olfactory learning in an electric 
shock model”. f, g, Courtship (f) and learning (g) indices of fru? TH- 
GAL4 UAS>stop>shi'* males. Training and testing were performed at the 
indicated temperatures with mated females. Box-and-whisker plots for CI show 
10th, 25th, 50th, 75th and 90th centiles and mean (+). Two asterisks, P< 0.01; 
n.s., P > 0.05 compared with males trained and tested at 22 °C. 


Figure 3 | Courtship learning requires synaptic 
transmission of aSP13 neurons. a, b, Surface 
representation of aSP4 (a) and aSP13 (b) neurons 
in a male brain”. There is typically one aSP4 
neuron and two to four aSP13 neurons per 
hemisphere. c, d, Overlay of registered and masked 
confocal images”* of aSP4 (c) and aSP13 

(d) neurons, labelled with the presynaptic marker 
green fluorescent protein (GFP)-tagged nsyb 
(magenta) and the dendritic marker Dscam17.1- 
GFP (green). Yellow arrowheads in d indicate the 
presynaptic innervation of aSP13 at the tip of the 
MB y lobe. e, Brain of a TH-GAL4 fru? 
UAS>stop>mCD8-GFP male stained with anti- 


n a body calyx horn GFP (green) and the general synaptic marker 
Oré7d —@ PA ee yo monoclonal antibody nc82 (magenta). 
# 5 i a : f, g, Enlarged and inverted views of the green 
CVA ~ DAT PNs | channel of e. Arrowheads indicate aSP4 (red, f) and 
MBy @ aSP13 (green, g) soma. h, i, Brain of fru” 
UAS>stop>mCD8-GFP males carrying either 
o 100,. : . = x ; . ‘ . Ore7d4 (h) or 201Y-GAL4 (i), stained with anti- 
80 i * GFP (green) and monoclonal antibody nc82 
! = i ; : E f F iN i (magenta). j-m, Brains of fru’'? 
& 6 | UAS>stop>mCD8-GFP males, additionally 
oO 40 | | i carrying the indicated GAL4 driver, stained with 
20 ii ; I | : I | ‘ ii f anti-GFP (black). n, Diagram of cVA processing 
I = 7 T I I i pathway, adapted from ref. 25. PN, olfactory 
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To test whether aSP4 and/or aSP13 neurons contribute to courtship 
learning, we chronically inhibited synaptic transmission in these neu- 
rons with tetanus toxin light chain (TNT), using drivers selective for 
either aSP4 or aSP13 (refs 24, 25). With each of five independent aSP13 
drivers, learning was reduced by about 50% compared with control 
males that carried an inactive version of the TNT transgene in the same 
genetic background (Fig. 3j-l, 0, p and Supplementary Table 13). A 
similar learning deficit was observed in positive controls in which we 
targeted TNT to both aSP13 and aSP4, to Or67d* OSNs*, or to MBy 
neurons’°”’ (Fig. 3h, i, o, p and Supplementary Table 13). In contrast, 
courtship learning was unimpaired in assays using either of two driver 
lines expressed in aSP4 but not aSP 13 (Fig. 3m, 0, p and Supplementary 
Table 13).We conclude that synaptic transmission of aSP13 neurons is 
crucial for courtship learning. 

The presynaptic termini of aSP13 neurons are located at the tip of 
the MB y lobe (Fig. 3d), indicating that they might convey a dopamine 
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learning signal to MBy neurons. If so, then a dopamine receptor 
should be required specifically in MBy neurons for courtship learning. 
We considered the DopR1 and DopR2 receptors as candidates, and 
used homologous recombination to generate analogous loss-of- 
function alleles for each gene (DopR1** and DopR2*""”, respectively). 
Both mutants are viable and fertile and homozygous naive males court 
at normal levels (Fig. 4a and Supplementary Table 14). However, 
courtship learning was significantly impaired in DopR1*” but not 
DopR2*” mutants (Fig. 4b and Supplementary Table 14), as was 
‘fictive learning’ induced by thermogenetic activation of dopaminergic 
neurons (Fig. 4c, d and Supplementary Table 15). Nevertheless, learn- 
ing was not completely eliminated in these DopR1 mutants, indicating 
that other dopamine receptors might also contribute. To confirm that 
the learning deficit in the DopRI*"” mutant was indeed due to loss of 
DopRI1 function, we reintegrated the deleted genomic region by site- 
specific transgenesis. Males homozygous for this repaired DopR1 
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Figure 4 | DopR1 functions in MBy neurons. a, b, Courtship (a) and learning 


(b) indices of DoR mutants. n.s., P > 0.05; two asterisks, P< 0.01 compared 


with wild-type (WT) males. c, d, Courtship (c) and ‘fictive learning’ (d) indices 


in fictive learning assays with mated female testers. Before testing, isolated 
males were either retained at the normal culture temperature of 22 °C (—) or 
warmed to 30°C for 45 min (+). For male genotypes, + and — indicate the 


presence or absence, respectively, of the TH-GAL4 and UAS-trpA 1 transgenes; 


for DopRI1, ‘+’ indicates the wild-type control allele and ‘attP’ the DopR1*"” 
mutant. Three asterisks, P< 0.0001 compared to wild type males. 


e, f, Courtship (e) and learning (f) indices on RNAi knockdown of DopR1. For 
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male genotypes, + and — indicate the presence or absence of the UAS-DopR1- 
IR RNAi transgene. n.s., P > 0.05; two asterisks, P< 0.01; three asterisks, 

P< 0.0001 compared with control males without the UAS-DopR1-IR 
transgene. g, h, Courtship (g) and learning (h) indices on rescue of DopR1 
function. All males are DopR1*"” mutants (attP) carrying a UAS-DopR1 
transgene (+) and either no (—) or the indicated GAL4 driver. Box-and- 
whisker plots for CI show 10th, 25th, 50th, 75th and 90th centiles and mean 
(+). n.s., P > 0.05; three asterisks, P< 0.0001 compared with control males 
without a GAL4 driver. 
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allele, DopR1**, performed just as well as wild-type males in courtship 
learning assays (Fig. 4a, b and Supplementary Table 14). 

Finally, we performed RNA-mediated interference (RNAi) knock- 
down and rescue experiments to test whether DopR1 function is 
indeed required in MBy neurons. Expression of a DopR1I RNAi trans- 
gene selectively in MBy neurons significantly reduced DopR1 expres- 
sion levels in the y lobe (Supplementary Fig. 3) and impaired courtship 
learning (Fig. 4e, f and Supplementary Table 16). Conversely, the 
learning disability of DopR1*"* mutants was fully alleviated by expres- 
sing a DopRI transgene specifically in MBy neurons (Fig. 4g, h and 
Supplementary Table 17). We therefore postulate that DopR1 acts in 
MBy neurons to transduce a dopamine learning signal provided by 
aSP13 neurons. 

To maximize his reproductive success, a Drosophila male should 
be highly attuned to those cues that discriminate receptive from 
unreceptive females. A male that is too selective may miss mating 
opportunities; a male that is too promiscuous may waste resources 
on futile courtship. The optimal tuning is likely to vary from place 
to place and from time to time, depending for example on local and 
seasonal fluctuations in the abundance and quality of mating partners 
and the pheromone signals that they provide. Our study defines a 
simple heuristic that could allow the male to learn an effective court- 
ship strategy in his local environment: be promiscuous at first, but 
become more selective if a mating attempt fails. Furthermore, we have 
identified key elements that implement this learning rule in the fly’s 
brain. We propose that, when a mating attempt fails, aSP13 dopami- 
nergic neurons convey a learning signal to MBy neurons through the 
DopRI receptor, and that this induces lasting changes in the internal 
processing of the cVA signal that discriminates mated females from 
virgins. Further studies of this genetically defined and tractable circuit 
should provide a detailed understanding of how a relatively simple 
learning circuit, embedded within decision-making centres of the 
brain, endows plasticity on an innate behaviour. 


METHODS SUMMARY 


Courtship conditioning assays and data analyses were performed as described 
previously**. CIs, defined as the percentage of time for which the male courts 
the female during a 10-min observation period, were scored manually from video 
recordings. Mann-Whitney-Wilcoxon tests were used for statistical comparisons 
of Cls between two data sets. Permutations tests were used to compare DIs and LIs, 
with 100,000 permutations of the raw data. For ‘fictive training’, males were 
collected at eclosion and aged in isolation for 5-7 days at 22 °C, transferred by 
gentle aspiration to prewarmed chambers at 30 °C for 45 min, then to 25 °C for 10- 
25 min before testing. Perfuming experiments with cV A were performed by apply- 
ing 1 pl of appropriate dilution to the female’s abdomen about 45 min before use as 
a tester. Immunostaining, confocal microscopy, image registration, and visualization 
were performed as described’’. Or47b°4 and Gro8a°*" alleles were generated by 
ends-in homologous recombination, and DopRI*” and DopR2*"? by ends-out 
targeting”*. For the Or47b and Gr68a mutants the GAL4-coding region replaces 
the entire endogenous coding region. For the DopR1 and DopR2 mutants the attP 
site replaces the respective first coding exons. The UAS-DopR1-IR line is from the KK 
library maintained at the Vienna Drosophila RNAi Center (VDRC; http:// 
www.vdrc.at). 


Full Methods and any associated references are available in the online version of 
the paper. 
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METHODS 


Fly strains. Or47b°4™ was generated by ends-in homologous recombination, 
following a strategy analogous to that previously used to generate Or67d@*"“ 
(ref. 6), using homology arms of 3.6kilobases (kb) and 2.7kb flanking the 
Or47b open reading frame. The initial duplication was resolved by I-Crel- 
mediated excision of the intervening white’ marker, yielding independent lines 
in which the Or47b open reading frame was either precisely replaced with the 
GAL4 open reading frame (Or47b“") or restored to wild type (Or47b*). The 
targeted alleles were verified by genomic polymerase chain reaction (PCR) and 
DNA sequencing across the entire homology region. Or47b°*! was also con- 
firmed to drive UAS transgene expression specifically in Or47b OSNs that target 
the VA1v glomerulus. Both alleles were crossed for four or five generations into a 
Canton S background before being used in behavioural assays. 

Gr68a°*" and its corresponding wild-type control, Gr68a*, were generated 
and verified in a similar manner, using homology arms of 4.5 kb and 3.1 kb 
flanking the Gr68a open reading frame. 

DopR1*"” was generated by ends-out homologous recombination, using homo- 
logy arms of 4.1 kb and 4.0 kb flanking the first coding exon of DopR1 (CG9652). 
This exon encodes the first 111 amino acids of DopR1. In the initial recombinant, 
this region was replaced with an attP site followed by a white’ marker flanked by 
mERT11 recognition sites for the mFLP5 recombinase*’. Removal of the white* 
marker using hs-mFLP5 generated the final DopR1*"” allele, in which the first exon 
is replaced by an aftP site and a single mFRT11 site. This structure was confirmed 
by genomic PCR and DNA sequencing across the entire homology region. 
DopRI*” was then crossed for four or five generations into a Canton S back- 
ground before being used in behavioural assays. 

DopR2* tt Was generated, verified and cantonized in a similar manner to that for 
DopR1*"". Initial targeting used homology arms of 4.0 kb and 4.0 kb flanking the 
first coding exon of DopR2 (CG18741). This exon encodes the first 482 amino acids 
of DopR2, and is replaced in the DopR2*" allele by an attP site and an mFRT11 
site. 

Other stocks: Additional stocks used in this study were 067d“) Or67d* 
and UAS-Or67d (ref. 6), fru''” and the GAL4 lines 3-8, 8-194, 10-16, NP368, 
NP3591 and NP7036 (ref. 25), TH-GAL4 (ref. 18), 201Y (ref. 26), 1471 (ref. 27), 
c305a and c739 (ref. 31), UAS-trpAl (ref. 17), UAS>stop>TNT and 
UAS>stop>TNT2 (ref. 24), UAS-DopR1-IR (VDRC stock number 107058; 
http://www.vdre.at), UAS-Der-2 (ref. 32), SP® (ref. 13), elav-GAL4 (ref. 33) and 
UAS-SP (ref. 12). UAS-DopR1 was generated by PCR amplification of the DopR1 
open reading frame from fly head cDNA with the primers 5’-CGCGGTA 
CCAAAATGACAAATGCAATGCGGGCGATTGCTGCAATC-3’ and 5'-CGC 
TCTAGAATCAAATCGCAGACACCTGCTCCAGTTCGG-3’, and cloning the 
product as an Asp718-Xbal fragment into a pUAST-derivative (pKC27) for 
C31-mediated transgenic insertion into the VIE-260 attP site on 
chromosome II (K.K. and B.J.D., unpublished observations). 

Courtship conditioning assays. Assays for short-term courtship conditioning 
were performed by testing males 10-15 min after training as described previ- 
ously’. Pseudomated females were elav-GAL4/+ UAS-SP/+ _ virgins. 
Pseudovirgin females were Canton S females that had been housed in groups of 
10-12 together with 10-12 SP° homozygous males for 24h. The males were then 


removed and females used within 1 h. cVA perfuming was performed by applying 
1pl of various dilutions of cVA (Pherobank) in acetone to the abdomen of 
pseudomated females under light CO, anaesthesia. Perfumed females were trans- 
ferred to food vials to recover for about 45 min before use. For ‘fictive learning’ 
experiments with UAS-trpA1, flies were raised at 22 °C, and males were collected at 
eclosion and aged in isolation for a further 5-7 days at 22 °C before being trans- 
ferred to chambers prewarmed to 30°C for 45 min. Males were then transferred 
back to courtship chambers at 22 °C and tested within 10-15 min. For transient 
inactivation experiments with UAS-shi', flies were raised at 22 °C and, if appro- 
priate, shifted to 30 °C for the entire training period or immediately after training 
and during the test. All tests were videotaped and manually scored for CI. 
Wherever possible, all genotypes and conditions for each experiment were assayed 
within a single session on each of several days. Where the number of assays per 
experiment precluded running them all within a single session, at least the controls 
were included in each replicate. In the rare cases in which data for the controls 
differed significantly between sessions, the entire data set for that session was 
excluded; otherwise data were then pooled across sessions. 

Statistical comparisons of CIs used the Mann-Whitney test, and DI and LI were 
compared using the permutation test with 100,000 random permutations”. By 
convention, DIs and LIs were calculated using the mean Cls. However, because CIs 
are generally not normally distributed, DIs, LIs and P values were also calculated 
separately using median CIs. Figures show LIs and P values calculated from mean 
CIs; Supplementary Tables show values derived from both mean and median CIs. 
Where appropriate, the false discovery rate for multiple hypothesis testing was 
assessed using the Benjamini-Hochberg procedure*’ with « = 0.05. Figures show 
uncorrected P values; Supplementary Tables indicate whether the data support the 
null hypothesis after this correction. Statistical significance was generally consist- 
ent whether mean or median CIs were used and unaltered by the correction for 
multiple hypothesis testing (see Supplementary Tables). 
Immunohistochemistry. Immunohistochemistry, confocal microscopy, image 
registration and visualization were performed as described previously”. 
cVA measurements. For cVA measurements, flies were prepared as for beha- 
vioural experiments and individually soaked in 30 ul of hexane for 5 min with 
agitation. n- Hexacosane and n-triacontane (100 ng of each) were added as internal 
standards. The fly was then removed and 1 j1l of the hexane extract was analysed by 
gas chromatography and mass spectrometry with a Shimadzu QP2010 apparatus’. 
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Adult neurogenesis arises from neural stem cells within specialized 
niches’*. Neuronal activity and experience, presumably acting on 
this local niche, regulate multiple stages of adult neurogenesis, 
from neural progenitor proliferation to new neuron maturation, 
synaptic integration and survival’’. It is unknown whether local 
neuronal circuitry has a direct impact on adult neural stem cells. 
Here we show that, in the adult mouse hippocampus, nestin- 
expressing radial glia-like quiescent neural stem cells*”? (RGLs) 
respond tonically to the neurotransmitter y-aminobutyric acid 
(GABA) by means of y2-subunit-containing GABA, receptors. 
Clonal analysis’ of individual RGLs revealed a rapid exit from 
quiescence and enhanced symmetrical self-renewal after con- 
ditional deletion of y,. RGLs are in close proximity to terminals 
expressing 67-kDa glutamic acid decarboxylase (GAD67) of par- 
valbumin-expressing (PV~*) interneurons and respond tonically to 
GABA released from these neurons. Functionally, optogenetic 
control of the activity of dentate PV* interneurons, but not that 
of somatostatin-expressing or vasoactive intestinal polypeptide 
(VIP)-expressing interneurons, can dictate the RGL choice 
between quiescence and activation. Furthermore, PV* inter- 
neuron activation restores RGL quiescence after social isolation, 
an experience that induces RGL activation and symmetrical 
division*®. Our study identifies a niche cell-signal-receptor trio 
and a local circuitry mechanism that control the activation and 
self-renewal mode of quiescent adult neural stem cells in response 
to neuronal activity and experience. 

Recent genetic lineage-tracing studies have identified nestin- 
expressing RGLs as quiescent neural stem cells (qNSCs) in the adult 
mouse hippocampus*”. In adult nestin-GFP mice", cells expressing 
green fluorescent protein (GEP* cells) in the subgranular zone (SGZ) 
with radial processes expressed GFAP (glial fibrillary acidic protein) 
but rarely MCM2 (minichromosome maintenance type 2), indicating 
quiescence (Supplementary Fig. la, b). To assess whether local inter- 
neurons regulate adult qNSCs directly by means of neurotransmitter 
release, we examined RGL responses to GABA in slices acutely pre- 
pared from adult nestin-GFP mice by electrophysiology (see 
Methods). GFP* RGLs recorded under whole-cell voltage-clamp 
showed prominent responses to GABA (200mM) or the GABA, 
receptor (GABA,R) agonist muscimol (200mM), which were 
abolished by the GABAgR antagonist bicuculline (BMI; 50M; 
Supplementary Fig. 1c, d). GABA responses were potentiated by 
diazepam (1M), which specifically enhances ‘>-containing 
GABAaR responses to GABA". Indeed, GFPt RGLs showed 
immunoreactivity to y2 (Supplementary Fig. le). y2-containing 
GABA ,Rs are present in non-neuronal cells and can be found both 
outside and inside synapses in mature neurons'’. No spontaneous or 
evoked synaptic currents in response to field stimulation of the dentate 


granule cell layer were detected in GFP’ RGLs (n=25 cells; 
Supplementary Fig. lf, g). Instead, tonic GABA responses were 
recorded (n = 18 cells; Fig. 1 and Supplementary Fig. 1g, h), suggesting 
GABA spill-over from nearby synapses". To exclude the possibility of 
synaptic inputs with low release probabilities, we applied hypertonic 
solution to enhance presynaptic release”. Increased GABA tonic res- 
ponses, but not synaptic currents, were observed (Supplementary 
Fig. 1h). Inhibition of the GABA reuptake transporter GAT1 with 
NO-711 (10 uM) also increased tonic responses (Fig. 1), further sup- 
porting the tonic nature of GABAergic responses in RGLs. 

We next explored pharmacological properties of tonic GABA 
responses in RGLs’*. Consistent with the y2 involvement, diazepam 
(1 uM) significantly increased tonic responses, whereas the benzodia- 
zepine antagonist flumazenil (10 1M) decreased them (Fig. 1). The 
as-selective benzodiazepine agonist midazolam (1011M), or the 
B3-selective positive allosteric modulator etomidate (ETMD; 100 nM), 
increased tonic GABA responses, whereas the «5-selective inverse 
agonist L-655708 (501M) decreased this response (Fig. 1). 
Together, these results suggest that «;B3y2 GABA,Rs are present in 
adult dentate RGLs to mediate tonic responses to GABA. 

To examine the functional role of GABA in regulating adult dentate 
RGLs in vivo, we assessed 5-ethynyl-2’-deoxyuridine (EdU) incorp- 
oration and MCM2 expression by RGLs after treatment with diazepam 
(Supplementary Fig, 2a). We identified RGLs as SGZ cells with nestin* 
radial processes (Fig. 2a). Stereological quantification showed that 
treatment with diazepam led to a 45% decrease in the number of 
EdU* RGLs compared with vehicle treatment (Fig. 2b). The number 
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Figure 1 | Tonic activation of adult quiescent neural stem cells by GABA by 
means of 058372 GABA Rs. a, Sample traces of whole-cell voltage-clamp 
recording from GEP* RGLs treated with diazepam (1 11M), flumazenil 

(10 pM), midazolam (10 1M), ETMD (100 nM) or L-655708 (50 LL.M), followed 
by BMI (100 LM) to obtain a baseline for normalizing tonic responses for each 
cell. b, Summary of normalized amplitude of tonic response. Values are means 
and s.e.m. ( = 4 or 5 cells; all significantly different from the basal condition; 
P<0.05; Student’s t-test). 
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Figure 2 | Cell-autonomous role of y2-containing GABA Rs in maintaining 
adult neural stem cell quiescence. a, b, Diazepam promotes quiescence of 
nestin’ RGLs in the adult dentate gyrus. a, Sample confocal images of 
immunostaining of nestin, MCM2, EdU and 4’,6-diamidino-2-phenylindole 
(DAPI). Arrows indicate nestin’ MCM2" or nestin‘EdU* RGLs. Scale bars, 
50 um (left) and 10 jum (last column). b, Summaries of stereological 
quantification of RGL EdU incorporation and MCM2 expression. Values are 
means and s.e.m. (n = 4 animals; asterisk, P< 0.01; Student’s t-test). c-e, y2 
deletion in individual RGLs leads to their activation. c, Sample confocal images 
of immunostaining. Scale bars, 10 im. d, e, Summaries of percentages of RGL 
clones that were activated (d) and those treated with vehicle or diazepam at 

7 days after induction (e) for control (cntl) and cKO mice. Values are means 
and s.e.m. (1 = 4-8 animals; asterisk, P< 0.01; n.s., P> 0.1; Student’s t-test). 


of MCM2‘nestin* RGLs and the percentage of RGLs that were 
MCM2* were also significantly decreased (Fig. 2b). Thus, systemic 
enhancement of y2-mediated GABA signalling promotes adult dentate 
RGL quiescence at the population level. 

To examine a cell-autonomous role of y, in RGLs, we generated 
nestin-CreER!?*/~ ;Z/EG” sy ff (cKO) mice and _— nestin- 
CreER?*”” ;Z/EG" sy9*/* (control) mice and used a low dose of 
tamoxifen for sparse induction to perform clonal analysis of adult 
RGLs? (Supplementary Fig. 2b-d). Immunohistology and electrophy- 
siology indicated highly efficient, but not complete, y2 deletion in 
GFP” RGLs (Supplementary Fig. 2e, f). In CKO mice, the percentage 
of RGL clones that were activated increased markedly compared with 
control mice at 2 and 7 days after induction (Fig. 2c, d). Treatment with 
diazepam decreased the percentage of activated RGL clones in control 
mice at 7 days after induction, but had no effect in cKO mice (Fig. 2e 
and Supplementary Fig. 2g). These results showed a direct role of 
GABA in maintaining adult NSC quiescence through 7, signalling. 

We next examined the fate choice of activated RGLs. There was a 
marked increase in pairs of closely associated GFP* RGLs at 2 days 
after induction in adult cKO mice compared with controls, indicating 
increased RGL symmetrical self-renewal (Fig. 3a, b). Detailed analysis 
at 7 days after induction showed increased symmetrical and astroglio- 
genic asymmetrical RGL division in cKO mice (Fig. 3c). Conversely, 
treatment with diazepam decreased RGL symmetrical division and 
astrogliogenic asymmetric division in control mice, but had no effect 
in cKO mice (Fig. 3d). In supporting short-term lineage-tracing 
results, analysis of clonal composition at 30days after induction 
showed decreased percentages of quiescent clones and an increased 
percentage of clones with multiple RGLs in cKO mice (Fig. 3e, f and 
Supplementary Fig. 3). Consistent with a role of GABA signalling in 
promoting new neuron survival’*, percentages of neurogenic clones 
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Figure 3 | Clonal analysis of RGL fate choice after conditional y, deletion in 
individual RGLs in the adult dentate gyrus. a—d, Short-term effect of y2 
deletion on the activation and fate choice of adult dentate RGLs. a, Sample 
confocal images of immunostaining for a GFP” clone indicating symmetrical 
division at 7 days after induction. Scale bars, 10 um. b-d, Summaries of 
percentages of clones indicating symmetrical divisions at 2 and 7 days after 
induction (b), and percentages of different types of RGL clones (c) and those 
treated with vehicle or diazepam (d) at 7 days after induction: R + R (two 
RGLs), R + intermediate progenitor cell (IPC; one RGL and one GFAP IPC) 
and R + A (one RGL and one GFAP* bushy astrocyte). Values are means and 
s.e.m. (n = 4-8 animals; asterisk, P< 0.05; n.s., P > 0.1; Student’s t-test). 

e, f, Long-term effect (at 30 days after induction) of 2 deletion on the 
composition of GFP~ clones in the adult dentate gyrus. e, Sample confocal 
images of immunostaining for a clone consisting of two GFAP cells with 
radial processes. Scale bars, 10 jum. f, Summary of percentages of different clone 
types among all GEP* clones: R, RGL; N, IPC or neuron; A, astrocyte. Values 
are means and s.e.m. (n = 4-8 animals; asterisk, P < 0.05; two asterisks, 
P<0.01; Student’s t-test). 


and multilineage clones were decreased significantly (Fig. 3f and 
Supplementary Fig. 3e). In contrast, clones without any RGLs were 
increased in cKO mice (Fig. 3f), suggesting increased RGL depletion 
after y2 deletion. Together, these gain-of-function and loss-of-function 
analyses identified GABA as a niche signal to maintain adult NSC 
quiescence and inhibit symmetrical self-renewal and astrocyte fate 
choice through y2-containing GABA,Rs under basal physiological 
conditions. 

We next sought to identify GABA-releasing niche cells among 
multiple interneuron subtypes in the adult dentate gyrus'>’®. 
Immunohistological analysis of adult nestin-GFP mice showed a close 
association between GFP* RGLs and GAD67~ terminals from PV* 


6 SEPTEMBER 2012 | VOL 489 | NATURE | 151 


©2012 Macmillan Publishers Limited. All rights reserved 


LETTER 


interneurons (Fig. 4a and Supplementary Movie 1). To determine 
whether PV* interneurons interact functionally with RGLs, we took 
an optogenetic approach and used double-floxed (DIO) adeno- 
associated virus (AAV) to express channelrhodopsin-2 (ChR2) or 
halorhodopsin (eNpHR3.0) specifically in PV* interneurons, using 
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Figure 4 | Regulation of quiescence and activation state of neural stem cells 
by PV‘, but not SST* or VIP* interneuron activity, in the adult dentate 
gyrus. a, Sample confocal images of GFP and immunostaining of PV and 
GAD67 (See Supplementary Movie 1). Scale bars, 5 um. b, Sample confocal 
image and schematic diagram of electrophysiological recording. Scale bar, 

10 pm. c, Sample whole-cell voltage-clamp recording traces of responses after 
light stimulation of ChR2*PV~ interneurons from a mature granule cell 
(mGC; 1 Hz) and a GEP* RGL (8 Hz) in acute slices, and after treatment with 
BMI (50 11M) or vigabatrin (VGA; 100 uM). d-f, Regulation of RGL activation 
in the adult dentate gyrus by local interneuron activity. Shown are summaries of 
stereological quantification of RGL EdU incorporation and MCM2 expression 
after in vivo activation (ChR2) or suppression (NpHR) of PV‘ (d), SST* (e) or 
VIP* (f) interneurons or sham treatment (cnt]; see Supplementary Figs 5a and 
6e for experimental procedures). Values are means and s.e.m. (n = 3 or 4 
animals; asterisk, P< 0.01; Student’s t-test). 
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adult PV-Cre mice'’ (Supplementary Fig. 4a). Immunostaining and 
electrophysiology confirmed the specificity and efficacy of AAV- 
mediated opsin expression in controlling the firing of dentate PV~ 
interneurons (Supplementary Fig. 4b-e). In acute slices from PV- 
Cre*’~ :nestin-GFP*’— mice, photoactivation of PV* interneurons 
induced synaptic responses in mature dentate granule cells and tonic 
responses in GEP* RGLs to GABA (Fig. 4b, c). Furthermore, a 
decrease in GABA turnover with the GABA transaminase inhibitor 
vigabatrin (100 1M) drastically increased GFP’ RGL responses to 
PV interneuron activation (Fig. 4c). Together, these results indicate 
that adult RGLs respond tonically to GABA released from local PV* 
interneurons. 

To assess the functional impact of PV~ interneuron activity on RGL 
behaviour, we photoactivated or suppressed PV™ interneurons in the 
dentate gyrus of adult PV-Cre mice for 5 days (Supplementary Fig. 5a). In 
comparison with sham treatment without light stimulation, EdU incorp- 
oration and MCM2 expression by RGLs were significantly decreased 
after activation of PV * interneurons expressing ChR2 tagged with yellow 
fluorescent protein (ChR2-YFP), resulting in a 53% decrease in RGL 
activation at the population level (Fig. 4d and Supplementary Fig. 5b). 
Conversely, suppression of PV~ interneurons expressing eNpHR- 
YFP led to a 95% increase in RGL activation (Fig. 4d). These results 
identified PV~ interneurons asa critical niche component and showed 
that PV* interneuron activity can dictate the RGL choice between 
quiescence and activation in the adult dentate gyrus. 

Do other subtypes of local interneurons also regulate RGL behaviour 
in vivo? We developed similar optogenetic strategies to manipulate 
somatostatin-expressing (SST) or vasoactive intestinal polypeptide- 
expressing (VIP * ) interneurons'® (Supplementary Fig. 6a). Both SST 
and VIP* interneurons showed elaborated processes in the SGZ and 
hilus region (Supplementary Fig. 6c,d and Supplementary Movie 2), 
and our procedure labelled greater numbers of SST" and VIP™ inter- 
neurons than PV~ interneurons in the adult dentate gyrus (Sup- 
plementary Fig. 6b). Electrophysiological recoding of GFP* RGLs 
did not detect any tonic or synaptic responses after light-induced 
activation of SST* or VIP* interneurons in acute slices (Supplemen- 
tary Fig. 6c, d). Functionally, photoactivated or suppressed dentate 
SST* or VIP* interneurons had no effect on EdU incorporation and 
MCM2 expression by RGLs (Fig. 4e, f and Supplementary Fig. 6e). 
Thus, coupling of neuronal circuit activity to RGL behaviour seems 
to be distinctive of PV* interneurons rather than occurring broadly 
across different local interneuron subtypes. 

Finally, we assessed whether GABA also serves as a niche signal to 
mediate experience-dependent regulation of RGLs. We subjected mice 
to a social isolation regime, which decreases neuronal activity in the 
adult dentate gyrus’* and was recently shown to promote RGL expan- 
sion’. Clonal analysis at 7 days after induction showed that, in contrast 
with group housing, social isolation led to a significant increase in 
GFP* RGL activation and symmetrical and astrogenic division, in a 
similar manner to y2 deletion in RGLs (Fig. 5a, b and Supplementary 
Fig. 7a). Y2-deficient RGLs showed no additional activation or fate 
alternation after social isolation (Fig. 5b). At the population level, 
EdU incorporation and MCM2 expression by RGLs were increased 
significantly after social isolation (Fig. 5c and Supplementary Fig. 7b, c). 
PV interneuron activation abolished the increase in RGL activation 
induced by social isolation (Fig. 5c). Thus, dentate PV* interneurons 
also mediate experience-dependent regulation of adult qNSCs through 
GABA-Y7> signalling. 

Precise control of somatic stem cell activity is essential for the long- 
term maintenance of tissue homeostasis and needs to be closely linked 
to tissue demands at any given time. Our study of adult RGLs at both 
clonal and population levels identified a previously unknown niche 
mechanism that regulates both adult qNSC activation and self-renewal 
mode in response to neuronal activity and experience (Supplementary 
Fig. 8). GABA has been shown to decrease the proliferation of other 
stem cells and progenitors in vitro, including mouse embryonic stem 
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Figure 5 | Contribution of GABA signalling from PV* interneurons to 
experience-dependent regulation of adult quiescent neural stem cells. 

a, b, Clonal analysis of RGL fate choice after social isolation. a, Sample confocal 
images of immunostaining for an activated clone with two RGLs at 7 days after 
induction after social isolation (see Supplementary Fig. 7 for experimental 
procedure). Scale bars, 10 jum. b, Summary of different types of clone at 7 days 
after induction. Values are means and s.e.m. (n = 4-8 animals; asterisk, 
P<0.05; Student’s t-test). c, Summaries of stereological quantification of RGL 
EdU incorporation and MCM2 expression. Values are means and s.e.m. (1 = 4 
animals; asterisk, P < 0.05; n.s., P > 0.1; Student’s t-test). 


cells, by means of GABA,Rs, the phosphatidylinositol-3-OH kinase 
(PI(3)K)-related kinase family and the histone variant H2AX’?”*. 
PTEN deletion in individual RGLs also leads to activation and 
symmetrical self-renewal in the adult dentate gyrus’, suggesting a con- 
served mechanism regulating the proliferation of various stem cells 
through the GABA,R and PI(3)K/PTEN pathway. 

Our optogenetic approach identified PV~ interneurons as a critical 
and unique niche component among different interneuron subtypes 
that couples neuronal circuit activity to NSC regulation in vivo under 
both physiological conditions and in response to specific experience. 
PV~ interneurons are abundant in the hippocampus and have been 
implicated in higher brain function and cognitive dysfunction”’. In the 
adult dentate gyrus, PV interneurons receive excitatory inputs from 
dentate granule cells and, to a smaller extent, from entorhinal cortical 
inputs (Supplementary Fig. 8a). We reconstructed one PV~ inter- 
neuron in the adult PV-Cre‘’” ;nestin-GFP*’ mice and estimated 
that it covered more than 200 GFP* RGLs in the dentate gyrus 
(Supplementary Movie 3). A characteristic feature of PV* interneur- 
ons is the formation of ensembles coupled by both electrical (through 
gap junctions) and chemical connections (through reciprocal innerva- 
tions)!°. Thus, PV* interneurons are well suited to couple local circuit 
activity to the regulation of a large number of adult NSCs in the 
hippocampus as an adaptive mechanism—increasing qNSC activation 
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when local circuitry activity levels are low, while keeping NSCs in 
quiescence when activity levels are high (Supplementary Fig. 8b). 
Given that both the number and properties of hippocampal PV* 
interneurons are regulated by physiological and pathological condi- 
tions, such as ageing, Alzheimer’s diseases, epilepsy, chronic stress, 
schizophrenia and other severe psychiatric illness*’*°, our findings 
have broad implications. 


METHODS SUMMARY 


Wild-type (C57BL/6), nestin-GFP'°, PV-Cre’’, SST-Cre'®, VIP-Cre'®, nestin- 
CreER'**’;Z/EG" ;y 7" (ref. 27) were used in the present study. Cre-dependent 
recombinant AAV’’ was used for interneuron subtype-specific expression of 
opsins in the adult dentate gyrus. Electrophysiological recordings and analysis 
were performed as described previously. Immunohistochemistry, confocal 
imaging and processing were performed as described previously’. Stereological 
quantification was assessed as described previously”’. All analyses were performed 
by investigators blind to experimental conditions. All animal procedures were 
performed in accordance with institutional guidelines. 


Full Methods and any associated references are available in the online version of 
the paper. 
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METHODS 


Animals, housing, administration of tamoxifen, EdU and AAV, and optoge- 
netic manipulations. The following genetically modified mice and crosses 
between them were used for electrophysiological analysis: mnestin-GFP'° 
(CB57BL/6 background), PV-Cre'’ (JAX laboratory; stock number 008069; stock 
name B6;129P2-Pvalb'"! 4"), SST-Cre!® (JAX laboratory; stock number 
013044; stock name Sst'””21(92i"/7) VIP-Cre!® (JAX laboratory, stock number 
010908; stock name Vip"”"“"/J), The following mice were used for neurogenesis 
analysis: wild-type (C57BL/6), nestin-CreER'?*’ ~;Z/EG* sys" loxed/floxed (ref 27, 
C57BL/6) and nestin-CreER'**’” ;Z/EG*’~ (C57BL/6), PV-Cre (B6;129), SST- 
Cre (B6;129), and VIP-Cre (B6;129). Animals were housed in a standard 14h 
light/10h dark cycle. Socially isolated animals were individually housed immedi- 
ately after weaning for at least 6 weeks before injection with tamoxifen or EdU, and 
had free access to food and water*. A single dose of tamoxifen (62 mg kg ') was 
injected intraperitoneally into 6-10-week-old mice as described previously’. 

For optogenetic manipulations, Cre-dependent recombinant AAV vectors were 
used based on a DNA cassette carrying two pairs of incompatible loxP sites with 
the opsin genes (ChR2-H134—mCherry, ChR2-H134-YFP or eNpHR3.0-YFP) 
inserted between lox sites in the reverse orientation as described previously’” 
(Supplementary Fig. 4a). The recombinant AAV vectors were serotyped with 
AAV2/9 for ChR2 (packaged at the UPenn Vector Core) and with AAV9 for 
eNpHR3.0 (packaged at University of North Carolina Vector Core). The following 
final viral concentrations were used for AAV viruses (X 10! particles ml‘): 7.4 
(ChR2-YFP), 36 (ChR2-mCherry) and 8 (eNpHR3.0-YFP), respectively. AAV 
was delivered stereotactically into the dentate gyrus with the following coordinates 
(in mm): anterioposterior = —2 from bregma; lateral = + 1.5; ventral = 2.2. Fibre 
optic cannulae (Doric Lenses, Inc.) were implanted at the same injection sites 
immediately after AAV injection with a dorsal-ventral depth of 1.6mm from 
the skull. Animals were then allowed to recover for at least 4 weeks after surgery. 
For analysis of RGL activation at the population level after optogenetic manipula- 
tions, littermates of animals were used and an in vivo light regime was administered 
8h per day for five consecutive days (Supplementary Figs 5a, 6e and 7b). For 
ChR2-YFP stimulation, flashes of blue light (472 nm; 5 ms at 8 Hz) through the 
DPSSL laser system (Laser Century Co. Ltd) were delivered in vivo every 5 min for 
30s per trial. For eNpHR-YFP stimulation, continuous yellow light (593 nm) was 
delivered in vivo. On the fifth day, animals were injected with EdU (41.1 mg per kg 
body weight) six times with an interval of 2h. Animals were killed 2 h after the last 
EdU injection and were processed for immunostaining as described previously’. 

All animal procedures were performed in accordance with institutional 
guidelines. 

Electrophysiology. Mice were anaesthetized and processed for slice preparation 
as described previously”*. In brief, brains were quickly removed into the ice-cold 
solution (inmM: 110 choline chloride, 2.5 KCl, 1.3 KH2POuq, 25.0 NaHCOs, 0.5 
CaCl,, 7 MgSO,, 20 dextrose, 1.3 sodium L-ascorbate, 0.6 sodium pyruvate, 5.0 
kynurenic acid). Slices 300 um thick were sectioned with a vibratome (Leica 
VT1000S) and transferred to a chamber containing the external solution 
(inmM: 125.0 NaCl, 2.5 KCl, 1.3 KH2POg, 1.3 MgSOq, 25.0 NaHCOs, 2 CaCl, 
1.3 sodium L-ascorbate, 0.6 sodium pyruvate, 10 dextrose, pH 7.4, 320 mOsM), 
bubbled with 95% O2/5% CO). Electrophysiological recordings were obtained at 
32-34 °C. GFP* RGLs located within the SGZ in adult nestin-GFP*’~ mice were 
revealed by differential interference contrast and fluorescence microscopy. A 
whole-cell patch-clamp configuration was employed in the voltage-clamp mode 
(Vin = —65 mV) or current-clamp mode. Microelectrodes (4-6 MQ) were pulled 
from borosilicate glass capillaries and filled with the internal solution containing 
(in mM)** 135 CsCl gluconate, 15 KCl, 4 MgCl, 0.1 EGTA, 10.0 HEPES, 4 ATP 
(magnesium salt), 0.3 GTP (sodium salt) and 7 phosphocreatine (pH7.4, 
300 mOsM). All RGL recordings were performed in the presence of kynurenic 
acid (5 mM). Data were collected with an Axon 200B amplifier and acquired with a 
DigiData 1322A (Axon Instruments) at 10 kHz. For measuring GABA-induced 
responses from GFP * RGLs, focal pressure ejection of 200 mM GABA or muscimol 
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through a puffer pipette controlled by a Picospritze (2 s puffat 3-5 Ib in” *) was used 
to activate GABA,Rs under the whole-cell voltage-clamp. A bipolar electrode 
(World Precision Instruments) was used to stimulate (0.1 ms duration) the dentate 
granule cell layer. Low-frequency stimuli (0.1 Hz) and theta bursts (8 Hz witha train 
of 100 stimuli) were delivered. The stimulus intensity (50 |1A) was maintained for all 
experiments. The following pharmacological agents were used: diazepam (1 1M), 
NO-711 (10M), flumazenil (10M), midazolam (10M), ETMD (10M), 
L-655708 (50M) and vigabatrin (100M). All drugs were purchased from 
Sigma except bicuculline (50 or 100 uM; Tocris). 

RGL recordings under optogenetic manipulation in acute brain slices were 

performed at least 4 weeks after injection with AAV. To stimulate ChR2 in labelled 
interneurons, light flashes (5 ms at 1, 8 or 100 Hz) generated by a Lambda DG-4 
plus high-speed optical switch with a 300 W Xenon lamp and a 472 nm filter set 
(Chroma) were delivered to coronal sections through a X40 objective lens (Carl 
Zeiss). To stimulate eNpHR in labelled interneurons, continuous yellow light 
generated by a DG-4 plus system with a 593 nm filter set were delivered to coronal 
sections across a full high-power (40) field. 
Immunohistochemistry, confocal imaging, processing and quantification. For 
immunostaining with anti-nestin and anti-MCM2, an antigen retrieval protocol 
was performed by microwaving sections in boiled citric buffer for 7 min as 
described previously’. For y2 immunostaining, a weak fixation protocol using live 
tissues was adopted as described previously~”*”. For characterization of different 
interneuron subtypes, the following antibodies were used: anti-PV (Swant; mouse 
or rabbit; 1:500 dilution), anti-GAD-67 (Millipore; mouse or rabbit; 1:500 
dilution), anti-SST (Millipore; rat; 1:200 dilution) and anti- VIP (Immunostar; 
rabbit; 1:200 dilution). For clonal analysis, coronal brain sections (40 1m) through 
the entire dentate gyrus were collected in a serial order, and immunostaining was 
performed with the following primary antibodies as described previously”: anti- 
GFP (Rockland; goat; 1:500 dilution), anti-nestin (Aves; chick; 1:500 dilution), 
anti-MCM2 (BD; mouse; 1:500 dilution), anti-GFAP (Millipore; mouse or rabbit; 
1:1,000 dilution) and anti-PSA-NCAM (Millipore, mouse lgM; 1:500 dilution). 
For quantification of GFP clones at 2 and 7 days after induction, a single GFP* 
RGL was scored as a quiescent clone. Two or more nuclei in a GFP* RGL clone 
were scored as activation. Clonal analysis at 30 days after induction was conducted 
exactly as described previously’. 

For experiments with diazepam (5 mgkg | body weight; once daily for 5 days), 
coronal brain sections (40 1m) through the entire dentate gyrus were collected ina 
serial order. For optogenetic manipulations, sections within a distance of 1.0 mm 
anterior and 1.0 mm posterior to injection sites were used for quantification, given 
the estimated light spread in vivo. Immunostaining was performed on every sixth 
section as described previously’. EdU labelling was performed with a Click-iT EdU 
Alexa Fluor imaging kit (Invitrogen). Images were acquired on a Zeiss LSM 710 
confocal system (Carl Zeiss) with a X40 objective lens using a multitrack config- 
uration. Stereological quantification of cells positive for various molecular markers 
was assessed in the dentate gyrus with a modified optical fractionator technique”. 
For quantification of Ed{U* or MCM2* RGLs, an inverted ‘Y’ shape from 
anti-nestin staining superimposed on EdU* or MCM2™ nucleus was scored 
double positive for nestin and EdU or MCM2. All analyses were performed by 
investigators blind to experimental conditions. Statistical analysis was performed 
with Student’s t-test. 

For generation of movie files, images were serially reconstructed in Reconstruct 
(J. C. Fiala, NIH), normalized, and deconvolved with Autoquant X2 (Media 
Cybernetics). Images were then segmented in MATLAB (The Mathworks) using 
custom code and imported into Imaris (Bitplane). Surface renderings and movies 
were made using the Surface and Animation functions, respectively, in Imaris 
(Supplementary Movies 1-3). 


30. Schneider Gasser, E. M. et a/. Immunofluorescence in brain sections: 
simultaneous detection of presynaptic and postsynaptic proteins in identified 
neurons. Nature Protocols 1, 1887-1897 (2006). 
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The identification of somatic activating mutations in JA K2 (refs 1-4) 
and in the thrombopoietin receptor gene (MPL)° in most patients 
with myeloproliferative neoplasm (MPN) led to the clinical develop- 
ment of JAK2 kinase inhibitors®’. JAK2 inhibitor therapy improves 
MPN.-associated splenomegaly and systemic symptoms but does not 
significantly decrease or eliminate the MPN clone in most patients 
with MPN. We therefore sought to characterize mechanisms by which 
MPN cells persist despite chronic inhibition of JAK2. Here we show 
that JAK2 inhibitor persistence is associated with reactivation of 
JAK-STAT signalling and with heterodimerization between activated 
JAK2 and JAK1 or TYK2, consistent with activation of JAK2 in trans 
by other JAK kinases. Further, this phenomenon is reversible: JAK2 
inhibitor withdrawal is associated with resensitization to JAK2 kinase 
inhibitors and with reversible changes in JAK2 expression. We saw 
increased JAK2 heterodimerization and sustained JAK2 activation 
in cell lines, in murine models and in patients treated with JAK2 
inhibitors. RNA interference and pharmacological studies show that 
JAK2-inhibitor-persistent cells remain dependent on JAK2 protein 
expression. Consequently, therapies that result in JAK2 degradation 
retain efficacy in persistent cells and may provide additional benefit 
to patients with JAK2-dependent malignancies treated with JAK2 
inhibitors. 

The development of targeted therapies has improved outcomes for 
patients with kinase-mutant malignancies*''; however, acquired 
resistance due to mutations in the target kinase’*"* or in other 
pathways that render cancer cells insensitive to kinase inhibitor 
therapy'*’* remain important clinical concerns. Although JAK inhi- 
bitors are now being used to treat patients with MPN, so far JAK 
inhibitor treatment has not been associated with significant decreases 
in disease burden in most patients with MPN*’. To understand 
mechanisms by which MPN cells survive despite chronic JAK kinase 
inhibition, we performed saturation mutagenesis’’ and next-generation 
sequencing in cells exposed to two structurally different JAK2 inhibi- 
tors, INCB18424 and JAK Inhibitor I. We identified second-site muta- 
tions in less than 30-50% of cells exposed to JAK2 inhibitors 
(Supplementary Table 1). Full-length resequencing of clones prolif- 
erating in the presence of INCB18424 or JAK Inhibitor I confirmed 
the absence of second-site JAK2 mutations in most surviving clones, 
and we did not identify second-site JAK2 mutations in granulocytes 
from five patients with MPN who had been treated with INCB18424. By 
contrast, control experiments with mutagenized BCR-AbI cells exposed 
to imatinib identified more than 20 known, clinically relevant, imatinib 
resistance alleles’®*° (data not shown). 

These data and clinical experience suggest that the failure of JAK2 
inhibitors to decrease disease burden is not due to acquired drug 


resistance but rather due to persistent growth and signalling in the 
setting of chronic JAK2 kinase inhibition. We therefore investigated 
the basis by which JAK2-dependent cells persist despite chronic JAK2 
kinase inhibition. We cultured SET-2/UKE-1 (JAK2V617F-positive 
leukaemia) cells and Ba/F3 cells expressing JAK2V617F (EporVF) or 
MPLW515L (WL) cells with INCB18424 or JAK InhibitorI for 
4-6 weeks. In each case we found that JAK2/MPL-mutant cells could 
survive and proliferate at inhibitor concentrations sufficient to prevent 
the growth of parental cells (Fig. la, b and Supplementary Figs la 
and 2a). JAK2-inhibitor-persistent (JAK2?*") cells were resistant to 
INCB18424-induced apoptosis (Supplementary Fig. 3). JAK2 
resequencing confirmed the absence of second-site mutations in all 
JAK2?* cell lines. JAK2?* cells were also insensitive to structurally 
divergent JAK inhibitors, including TG101348, a JAK2-selective 
inhibitor in late-stage clinical trials (Fig. 1c and Supplementary 
Figs 1b, c, 2b and 4). These data indicate that JAK2?* cells are 
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Figure 1 | Generation of JAK2-inhibitor-persistent cells. a, b, Proliferation 
of naive and persistent SET-2 (a) and WL (b) cells with JAK2 inhibitors. Data 
(means = s.d.) are from wells plated in triplicate and are representative of three 
independent experiments. c, IC59 values of SET-2 INP* and WL IN?™ cells 
exposed to INCB18424, TG101348 and JAK Inhibitor I (JAK Ib I). 
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insensitive to different JAK inhibitors regardless of previous exposure 
to that inhibitor. 

These data are consistent either with the selection of a subpopula- 
tion of pre-existing, persistent cells, as previously posited for epidermal 
growth factor receptor (EGFR) inhibitor-insensitive “drug-tolerant 
persisters”’, or with the acquisition of persistence by naive, inhibitor- 
sensitive cells. To distinguish between these possibilities, we derived 
single-cell clones of inhibitor-naive JAK2/MPL mutant cell lines. 
Each clonally derived naive cell line was sensitive to JAK inhibitors 
and retained the capacity to become persistent over time to different 
JAK inhibitors (Supplementary Fig. 5 and data not shown). These 
data depict a general capacity for persistence in the absence of clonal 
selection. 

Next, we assessed signalling downstream of JAK2 in JAK2" cells. 
We observed dose-dependent inhibition of downstream signalling 
in naive cells treated with INCB18424 or JAK InhibitorI, but not 
in INCB18424°" (Fig. 2a and Supplementary Fig. 6a) or JAK 
Inhibitor I°* cells (Supplementary Fig. 6b). Similarly, ex vivo treat- 
ment of granulocytes from patients chronically treated with 
INCB18424 demonstrated sustained downstream signalling at 
inhibitor concentrations that inhibited signalling in naive MPN 
patient samples (Fig. 2b). We then examined whether persistence 
was associated with constitutive JAK2 activation. We observed persist- 
ent phosphorylation of JAK2 in JAK2°* cells (Supplementary Figs 2c 
and 6c). Further, gene expression analysis showed that the expression 
of known JAK-STAT target genes was maintained in JAK'™ cells, 
whereas these genes were suppressed with acute treatment of 
inhibitor-naive parental cells (Supplementary Fig. 7). 

Given that JAK inhibitors should inhibit JAK2 autophosphoryla- 
tion, we reasoned that other kinases might associate with and 
phosphorylate JAK2 in persistent cells. Although EpoR and MPL 
predominantly signal through JAK2 (ref. 22), previous studies have 
shown that many cytokine receptors signal through JAK kinase 
heterodimers”*. We therefore assessed the activation status of JAK1, 
JAK3 and TYK2 in naive and persistent SET-2 and WL cells. We 
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observed increased phosphorylation of JAK1 in JAK2°™ cells in com- 
parison with parental cells, whereas TYK2 was constitutively phos- 
phorylated in both parental and JAK2?" cells (Fig. 2c). Accordingly, 
immunoprecipitation studies demonstrated that JAK1 and TYK2 
associated with phosphoJAK2 in JAK2"*" SET-2, WL (Fig. 2d) and 
UKE-1 (Supplementary Fig. 2d) cells, but not in the respective parental 
cells. We saw a similar association between phosphoJAK2 and JAK1 or 
TYK2 in INCB18424-treated patient samples but not in inhibitor- 
naive patient samples (Fig. 2e and Supplementary Table 2). 

Next, we examined whether the JAK’ cells were insensitive to JAK 
inhibitors. In vitro kinase assays revealed that the JAK" heterodimer 
complex could phosphorylate myelin basic protein at concentrations 
of INCB18424 sufficient to inhibit JAK2 kinase activity in naive SET-2 
cells (Supplementary Fig. 8). These data suggest that the heterodimer 
complex in JAK** cells retains kinase activity that is relatively 
insensitive to JAK inhibitors. To determine whether JAK1-mediated 
phosphorylation of JAK2 was insensitive to INCB18424, we co- 
expressed a constitutively active mutant form of JAK1 (JAK1V658F)* 
with kinase-dead JAK2 (JAK2K882E) in JAK2-deficient 2A cells. We 
observed persistent JAK2 phosphorylation in JAK1V658F/JAK2K882E 
2A cells exposed to INCB18424 at concentrations sufficient to inhibit 
JAK2 autophosphorylation (Supplementary Fig. 9). 

We then investigated whether persistence of JAK2 inhibitor was 
reversible. We removed INCB18424 or JAK Inhibitor I for 2-4 weeks; 
this led to JAK inhibitor resensitization (Fig. 3a and Supplementary 
Fig. 10a, b). Resensitized (JAK2"°s"5) cells were sensitive to all three 
JAK inhibitors, suggesting that patients with MPN may respond to 
retreatment or to a different JAK2 inhibitor after a brief withdrawal of 
treatment. JAK1 or TYK2 association with phosphoJAK2 was lost in 
JAK2"*""s cells (Fig. 3b and Supplementary Fig. 10c), and activated 
JAK2 levels were lower in JAK2®*°"S cells (Supplementary Fig. 10d). 

Previous work attributed persistence in EGFR inhibitor-insensitive 
‘drug-tolerant persisters’”' to the engagement of alternative survival 
pathways. By contrast, JAK’ cells were characterized by JAK-STAT 
pathway reactivation (Fig. 2). We therefore speculated that changes in 
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Figure 2 | Inhibitor-persistent cells and granulocytes from INCB18424- 
treated patients show continual JAK-STAT signalling and JAK2 activation 
through transphosphorylation by JAK1 and TYK2. a, SET-2 and SET-2 
IN?** cells were washed and incubated for 4 h with increasing concentrations of 
INCB18424 and western blotted. MAPK, mitogen-activated protein kinase. 

b, Granulocytes from naive and INCB18424-treated patients (Pt.) were 
incubated ex vivo for 6h with dimethylsulphoxide (DMSO) or 150nM 
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INCB18424 and western blotted. c, Increased phosphorylation of JAK1 in 
persistent cells and constitutive TYK2 phosphorylation in both naive and 
persistent cells. d, Increased association between phosphoJAK2 and both JAK1 
and TYK2 in SET-2 JAK" cells and increased association between JAK2 and 
both JAK1 and TYK2 in WL JAK?* cells. e, JAK1/TYK2 association with 
phosphoJAK2 in granulocytes from three INCB18424-treated (IN Tx) patients, 
which is not observed in INCB18424-naive MPN samples. 
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Figure 3 | JAK2 inhibitor persistence is reversible and JAK2 levels correlate 
with persistence and resensitization. a, Percentage viability of SET-2 
persistent (IN?) and resensitized (IN*°*") cells at 0.25 uM JAK Inhibitor I, 
0.25 uM INCB18424 and 2 uM TG101348. Data (means + s.d.) are from wells 
plated in triplicate and are representative of three independent experiments. 
b, Loss of JAK1/TYK2 association with phosphoJAK2 in SET-2 and WL 
IN®*°"5 cells. c, Reversible changes in JAK2 levels in IN?* cells compared with 


the epigenetic regulation of JAK2 might contribute to JAK inhibitor 
persistence. JAK2 messenger RNA (Supplementary Fig. 11) and JAK2 
protein (Fig. 3c and Supplementary Figs 2e and 10e) levels were higher 
in JAK2°" cells than in parental cells, and were lower in JAK2"°°"5 
cells. Chromatin immunoprecipitation sequencing (ChIP-Seq) 
analysis of naive JAK2-mutant SET-2 cells (M.A., O.A.W., B.E.B. 
and R.L.L., unpublished observations) revealed that the JAK2 locus 
is characterized by trimethylation of histone H3 on Lys4 
(H3K4me3), a modification associated with active promoters, and 
by H3K9 trimethylation, a mark more typically associated with 
inactive heterochromatin (Supplementary Fig. 12a and Supplemen- 
tary Table 3). Analysis of the JAK2 locus by ChIP coupled to quant- 
itative polymerase chain reaction (ChIP-qPCR) showed a significant 
increase in H3K4me3 and a decrease in H3K9me3 in JAK2?* cells in 
comparison with parental cells (Fig. 3d), which is consistent with a 
change to a more active chromatin state at the JAK2 locus. However, 
global H3K4me3 levels in naive and persistent cells remained 
unchanged, which is consistent with specific effects on H3K4me3 at 
the JAK2 locus in persistent cells (Supplementary Fig. 12b). 

Given that JAK2 protein levels, and particularly phosphoJAK2 levels, 
increased with persistence, we examined whether JAK2 inhibitor per- 
sistence was also associated with post-transcriptional stabilization of 
total and activated JAK2. We have previously shown that JAK2 levels 
decline rapidly on treatment with cycloheximide in JAK2-mutant 
cells”. We noted a time-dependent decrease in phosphoJAK2 and total 
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naive and IN®**°"s SET-2 and WL cells. d, ChIP-qPCR of the JAK2 locus shows 
increased H3K4me3 and decreased H3K9me3 marks in SET-2 IN? cells. 
e, PhosphoJAK2 and total JAK2 levels are degraded on treatment with 
cycloheximide (CHX; 500 pg ml ! for 2, 4 and 6h) in naive and resensitized 
WL cells, but not in IN? cells. f, Higher JAK2 levels in INCB18424-treated 
MPN granulocytes by (RT-PCR compared with those in a small cohort of best 
responders. 


JAK2 levels in naive and resensitized WL/SET-2 cells; however, expo- 
sure to cycloheximide did not result in a significant decline in JAK2, or 
more notably in phosphoJAK2, in INCB18424°" cells (Fig. 3e and 
Supplementary Fig. 13). These data suggest that chronic treatment with 
inhibitor results in the stabilization of activated JAK2, which, combined 
with increased JAK2 mRNA expression, facilitates the formation of 
heterodimers. 

We then assessed whether this phenomenon was observed in vivo. We 
treated mice engrafted with MPLW515L-mutant murine bone marrow”® 
with vehicle or with INCB18424. Treatment with INCB18424 was asso- 
ciated with decreased splenomegaly; however, the proportion of malig- 
nant cells was not decreased on treatment with JAK inhibitor, in 
concordance with our previous results (Supplementary Fig. 14a)”. 
Treatment with INCB18424 was associated with an increase in JAK2 
mRNA and JAK2 protein expression (Supplementary Fig. 14b), similar 
to that observed in JAK2?™ cells. We also observed an increase in JAK2 
granulocyte mRNA levels in INCB18424-treated patients without 
clinical or molecular responses, in contrast with patients with clinical 
or molecular responses to INCB18424 (P= 0.05) (Fig. 3f and Sup- 
plementary Table 2). Finally, we noted increased JAK2 phosphorylation 
and increased association between JAK1 and JAK2 in haematopoietic 
cells from MPLW515L-mutant mice treated with INCB18424. (Sup- 
plementary Fig. 14c, d), which is consonant with the expression data. 

We examined whether JAK2"*" cells remain JAK2 dependent. 
JAK2 silencing inhibited proliferation (Fig. 4a), JAK2 activation and 
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Figure 4 | Transphosphorylation of JAK2 by JAK1/TYK2 contributes to 
persistence, and persistent cells can be targeted with type II JAK2 inhibitors 
or Hsp90 inhibition. a, SET-2 cells were transfected with non-targeting 
(shScr) or two JAK2 shRNAs (shJAK2-1 and shJAK2- 2). Viability after 10 days 
of puromycin selection relative to cell numbers on day 1 is shown. Results are 
from three biological replicates (means = s.e.m.). b, JAK2 knockdown inhibits 
signalling in puromycin-selected sensitive and persistent SET-2 cells. c, IN°™ 
SET-2 cells were partly resensitized to INCB18424 after loss of JAK1 or 


downstream signalling (Fig. 4b) in naive and JAK2° SET-2 cells, 
which is consistent with a requirement for JAK2 expression in 
JAK2" cells. These data are consistent with previous studies in 
prolactin receptor cellular systems demonstrating that catalytically 
inactive JAK2 can serve as a scaffold for transactivation and down- 
stream signalling”’. However, this had not previously been implicated 
in JAK-dependent malignancies or in the response to JAK kinase 
inhibitors. Knockdown of JAK1 and TYK2 increased the sensitivity 
of SET-2 INCB18424°* and SET-2 JAK Inhibitor I° cells to 
INCB18424 and JAK Inhibitor I, respectively (Fig. 4c and Supplemen- 
tary Fig. 15a-c), whereas the parental cells remained unaffected by 
JAK1 and TYK2 knockdown (Supplementary Fig. 15d). Further, 
JAK1 and TYK2 knockdown led to decreased downstream signalling 
and decreased JAK2 phosphorylation in the persistent cells (Sup- 
plementary Fig. 15e, f). 

We next assessed whether new therapeutic approaches might 
reverse JAK inhibitor persistence. We previously reported that 
Hsp90 inhibitors increase JAK2 degradation in vitro and in vivo”. 
JAK2°* and parental cells were equally sensitive to Hsp90 inhibition 
by PU-H71 (Fig. 4d and Supplementary Fig. 16a), and PU-H7] treat- 
ment led to JAK2 degradation and inhibited signalling in JAK2°™ cells 
(Fig. 4e). The currently available type I JAK inhibitors are conforma- 
tion dependent and can only engage activated JAK2 (ref.28). We 
therefore tested the effects of BBT-594, a type I] inhibitor that retains 
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JAK1 + TYK2 using siRNA. Data (means + s.d.) are from wells plated in 
triplicate and are representative of three independent experiments. d, Naive 
and persistent SET-2 cells are inhibited by PU-H71. Data (means + s.d.) are 
from wells plated in triplicate and are representative of three independent 
experiments. e, PU-H71 degrades JAK2 and inhibits signalling in SET-2 cells. 
Cells were treated with DMSO or 2 uM PU-H71 (SET-2) and 1 uM PU-H71 
(WL) for 16h. f, Treatment with BBT-594 for 4h inhibits signalling in naive 
and persistent SET-2 cells. 


the ability to bind inactive JAK2 (ref. 28), in JAK2" cells. BBT-594 
inhibited the proliferation, JAK activation, and signalling of naive and 
JAK" cells to a similar extent (Fig. 4f and Supplementary Fig. 16b, c). 

Taken together, our results suggest that kinase inhibitor persistence 
can occur through reversible changes in JAK2 expression and trans- 
phosphorylation (Supplementary Fig. 17). We show that persistent 
JAK2 activation in the setting of exposure to JAK inhibitor allows cells 
to survive without decreasing dependence on JAK2 expression. 
Consequently, treatments that lead to JAK2 degradation (Hsp90 inhi- 
bitors or histone deacetylase inhibitors)”*”° or that retain the ability to 
inhibit JAK2 in persistent cells have the potential to improve thera- 
peutic efficacy in patients with MPN. 


METHODS SUMMARY 

Generation of JAK2-inhibitor-persistent cells. Cells were cultured continuously 
in increasing concentrations of INCB18424 or JAK Inhibitor I for 4-6 weeks. Cells 
were considered resistant when the half-maximal inhibitory concentrations (ICso 
values) of the persistent derivatives were at least double the IC59 of parental cells 
(verified by in vitro inhibitor assays). Persistent cells were cultured continuously in 
the presence of the JAK2 inhibitor. For resensitization experiments, inhibitor was 
withdrawn from the medium and cells were cultured in the absence of the drug for 
2-4 weeks. 

Knockdown of JAK2 and TYK2 in human cell lines. Short hairpin RNA 
(shRNA) for JAK2 was purchased from the High Throughput Drug Screening 
Facility at Memorial Sloan-Kettering Cancer Center, or was a gift from L. Staudt. 
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shRNA against TYK2 was a gift from T. Look. Whenever required, shRNA oligo- 
nucleotides were cloned into pLKO lentiviral systems. Cell lines were transfected 
with lentivirus, and selected with puromycin. Short interfering RNA (siRNA) 
targeting either JAK1 or TYK2 was purchased from Invitrogen and used in accord- 
ance with the manufacturer’s instructions. 

Murine model and analysis of mice. The MPLW515L murine BMT assay was 
performed as described previously’. Sick mice were randomized to receive 
INCB18424 twice daily at 60 and 90 mgkg ' or vehicle (0.5% methylcellulose) 
by oral gavage. Mice were treated for 28 days or until any one of several criteria for 
killing were met, including moribundity, more than 10% body weight loss, and 
palpable splenomegaly extending across the midline. Animal care was in strict 
compliance with Memorial Sloan-Kettering Cancer Center guidelines. Bone 
marrow and spleen cells were strained and viably frozen in 90% FCS and 10% 
DMSO. 


Full Methods and any associated references are available in the online version of 
the paper. 
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METHODS 

Reagents and cell lines. The pan JAK inhibitor, JAK Inhibitor I, was purchased 
from Calbiochem (catalogue no. 420097). The JAK1 and JAK2 specific inhibitor 
INCB18424 was purchased from Chemietek. PU-H71 (8-(6-iodobenzo[d]|[1.3] 
dioxol-5-ylthio)-9-(3-(isopropyl amino)propyl)-9H-purine-6-amine) was syn- 
thesized as reported previously*’. BBT-594 was a gift from T. Radimerski. Stock 
aliquots (1 mM) were prepared in DMSO and diluted in appropriate medium 
before use. Antibodies used for western blotting and immunoprecipitation 
included phosphorylated and total JAK2, STAT3, mitogen-activated protein 
kinase, AKT and phosphoSTAT5 (Cell Signaling Technologies). Total STATS 
antibody was purchased from Santa Cruz Biotechnology, and actin from EMD 
Chemicals. JAK1 and TYK2 antibodies were purchased from BD Transduction. 
Pan phosphotyrosine antibody was purchased from Millipore. The generation and 
maintenance of Ba/F3 hMPLW515L and Ba/F3 EpoR-V617F cells have been 
described previously’. The JAK2V617F-positive human leukaemic cell line SET- 
2 was grown in RPMI 1640 medium with 20% heat-inactivated serum, whereas 
UKE-1 (also JAK2V617F-positive) cells were grown in RPMI 1640 with 10% fetal 
calf serum, 10% horse serum and 11M hydrocortisone (Sigma; catalogue no. 
H6909). Cycloheximide was purchased from Sigma. 

Knockdown of JAK1, JAK2 and TYK2 by siRNA or shRNA. siRNA oligonu- 
cleotides against JAK1 and TYK2 were purchased from Invitrogen and used in 
accordance with the manufacturer’s instructions. The two siRNA oligonucleotides 
used for JAK1 knockdown were 5’-GCACAGAAGACGGAGGAAAUGGU. 
AU-3’ (JAKIVHS41387) and 5’-GCCUUAAGGAAUAUCUUCCAAAGAA-3’ 
(JAK1VHS41388). The siRNA sequence for TYK2 included a combination of two 
oligonucleotides (5’-UUCUCAUGGACUGUCUUCAGAAUGG-3’ (TYK2VHS41729) 
and 5'-GCAGCAAGUAUGAUGAGCAAGCUUU-3’ (TYK2VHS41246)). Scrambled 
siRNA was purchased from Dharmacon (D-001206-13-20). Cells were transfected 
with scrambled siRNA, siJAK1, siTYK2, or both siJAK1 and siTYK2. Viability 
assays were set up 24h after transfection and harvested after 48h. Cells were 
harvested at 72h after transfection to verify knockdown and assess downstream 
signalling. Persistent cells were cultured in the presence of inhibitor during the 
entire experiment. shRNA oligonucleotides against JAK2 and TYK2 were gifts 
from L. Staudt and T. Look, respectively. shRNA target sequences used for knock- 
down of JAK2 were 5'-CTCTTCGAGTGGATCAAATAA-3’ (shRNA 1) and 
5'-GCAGAATTAGCAAACCTTATA-3’ (shRNA 2). The target sequence for 
shRNA against TYK2 was 5'-CGTGAGCCTAACCATGATCTT-3’. Lentiviral 
particles were generated with the use of standard procedures. Cells were spinfected 
with virus and selected with puromycin. Cell viability was monitored with trypan 
blue (for JAK2 knockdown studies), and cells were harvested 10 days after selection 
in puromycin. JAK2”" cells were cultured in the presence of respective inhibitors 
during the entire experiment. 

In vitro inhibitor assays, western blot analysis and immunoprecipitations. 
Viable cells were plated in triplicate at 10,000 cells per well in 96-well tissue culture 
treated plates in 200 ul medium with increasing concentrations of the JAK2 
inhibitor or PU-H71. Inhibitor assays at 48 h were assessed with the cell viability 
luminescence assay CellTiter-Glo (Promega; catalogue no. G7571). Results were 
normalized to growth of cells in medium containing an equivalent volume of 
DMSO. The effective concentration at which 50% inhibition in proliferation 
occurred was determined with GraphPad Prism 5.0 software. 

For western blot analysis, cells were harvested after treatment and processed as 
described previously”*. For immunoprecipitation experiments, cells were harvested 
either at steady-state conditions or after 4h of incubation with a JAK2 inhibitor. 
Protein was normalized with the Bradford dye, and 500-1,000 jig of total protein 


was incubated overnight with the appropriate antibody, followed by incubation 
with Protein G-agarose beads (EMD Chemicals) for a further 2 h. After incubation, 
cells were washed three times with cold PBS and boiled with Laemmli buffer for 
12 min. Supernatant was loaded onto gels and separated as described previously”*. 
Quantitative RT-PCR analyses. Total RNA was extracted with the RNeasy Mini 
Kit (Qiagen), and cDNA was synthesized with the Verso cDNA Kit (Thermo 
Scientific). Quantitative PCR was performed with FastStart Universal SYBR 
Green Master (Roche) with the following primer sequences: mouse JAK2, 
5'-GATGGCGGTGTTAGACATGA-3’ (forward) and 5'-TGCTGAATGAATC 
TGCGAAA-3’ (reverse); mouse f-actin, 5'-GATCTGGCACCACACCTTCT-3’ 
(forward) and 5’-CCATCACAATGCCTGTGGTA-3’ (reverse); human JAK2, 
5'-TCTTTCTTTGAAGCAGCAAG-3’ (forward) and 5'-CCATGCCAACTGTT 
TAGCAA-3’ (reverse); human HPRT1, 5’-AGATGGTCAAGGTCGCAAG-3' 
(forward) and 5'-GTATTCATTATAGTCAAGGGCATATC-3’ (reverse). 
Chromatin immunoprecipitation (ChIP) assay. We performed ChIP-qPCR and 
ChIP-Seq analysis in SET2-naive and JAK2-inhibitor-persistent cells with the use 
of a previously described ChIP method. In brief, chromatin from fixed cells was 
fragmented to a size range of 200-700bases with a Branson 250 Sonifier. 
Solubilized chromatin was immunoprecipitated with antibody against 
H3K4me3 (Abcam 8580), H3K9me3 (Abcam 8898) and H3K27me3 (Upstate 
07-449). Each of these antibodies was validated by western blots and peptide 
competitions as described previously’. Antibody-chromatin complexes were 
pulled down with Protein A-Sepharose, washed and then eluted. After crosslink 
reversal and Proteinase K treatment, immunoprecipitated DNA was extracted 
with phenol/chloroform, precipitated with ethanol, and treated with ribonuclease. 
ChIP DNA was quantified with PicoGreen. For ChIP-qPCR, primer sequences for 
qPCR tiling primers across the JAK2 promoter region are listed in Supplementary 
Table 3. qPCR was performed on an ABI-7500 instrument. For ChIP-Seq in native 
SET2 cells, ChIP DNA and input controls were sequenced with the Illumina 
Genome Analyzer. 

In vitro kinase assays. Protein was harvested from naive and IN’ SET-2 cells and 
used for in vitro kinase assays. Endogenous JAK2 protein was precipitated with 
anti-JAK2 antibody (Santa Cruz; catalogue no. sc-34480) and Protein 
G-Sepharose gel. For JAK2 activity assay, the immunoprecipitated JAK2 was 
incubated with myelin basic protein in a buffer containing 25mM Tris-HCl 
pH7.5, 10mM MgCh, 5uM ATP and 2mM dithiothreitol. The reaction was 
incubated at 25 °C with 1 and 10nM INCB18424 for 1 h and stopped by addition 
of the SDS sample loading buffer. Samples were run under reducing conditions on 
SDS-PAGE gels and immunoblotted using a pan phosphotyrosine antibody 
(Millipore). 

Patient samples. The Institutional Review Boards of Memorial Sloan Kettering 
Cancer Center and M.D. Anderson Cancer Center approved sample collection 
and all experiments. Informed consent was obtained from all human subjects 
before study. Granulocytes were extracted with standard procedures from patient 
samples, and viably frozen before use. 

Gene expression analyses. Ba/F3 WL cells were treated in triplicate for 4 h with 
either DMSO or 0.8 1M INCB18424. IN" WL cells were also treated in triplicate for 
4h with 0.8 uM INCB18424, after which cells were harvested in TRIzol. RNA was 
extracted from the cells and analysed for gene expression with Affymetrix micro- 
array version MOE 430 2.0. Data were analysed with Partek GS v. 6.5 software. 


31. He,H. etal. Identification of potent water-soluble purine-scaffold inhibitors of the 
heat shock protein 90. J. Med. Chem. 49, 381-390 (2006). 
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Endogenous antigen tunes the responsiveness of 
naive B cells but not T cells 


Julie Zikherman!, Ramya Parameswaran! & Arthur Weiss»? 


In humans, up to 75% of newly generated B cells and about 30% of 
mature B cells show some degree of autoreactivity’. Yet, how B cells 
establish and maintain tolerance in the face of autoantigen expo- 
sure during and after development is not certain. Studies of model 
B-cell antigen receptor (BCR) transgenic systems have highlighted 
the critical role of functional unresponsiveness or ‘anergy’. 
Unlike T cells, evidence suggests that receptor editing and anergy, 
rather than deletion, account for much of B-cell tolerance*”. 
However, it remains unclear whether the mature diverse B-cell 
repertoire of mice contains anergic autoreactive B cells, and if so, 
whether antigen was encountered during or after their develop- 
ment. By taking advantage of a reporter mouse in which BCR 
signalling rapidly and robustly induces green fluorescent protein 
expression under the control of the Nur77 regulatory region, 
antigen-dependent and antigen-independent BCR signalling 
events in vivo during B-cell maturation were visualized. Here we 
show that B cells encounter antigen during development in the 
spleen, and that this antigen exposure, in turn, tunes the respon- 
siveness of BCR signalling in B cells at least partly by downmodu- 
lating expression of surface IgM but not IgD BCRs, and by 
modifying basal calcium levels. By contrast, no analogous process 
occurs in naive mature T cells. Our data demonstrate not only that 
autoreactive B cells persist in the mature repertoire, but that 
functional unresponsiveness or anergy exists in the mature B-cell 
repertoire along a continuum, a fact that has long been suspected, 
but never yet shown. These results have important implications for 
understanding how tolerance in T and B cells is differently 
imposed, and how these processes might go awry in disease. 

A new reporter of antigen receptor signalling was generated recently 
to examine developmental checkpoints during thymic development’. 
This took advantage of the dynamic expression pattern of the orphan 
nuclear hormone receptor Nur77 (also known as NR4A1), which is 
induced rapidly in response to negative selection and T-cell receptor 
(TCR) stimulation, to develop a green fluorescent protein (GFP) 
reporter bacterial artificial chromosome (BAC) transgenic line of 
mice’. Interestingly, Nur77 is also an immediate early gene that is 
rapidly transcriptionally upregulated in response to BCR signalling’. 
To visualize antigen receptor signalling in vivo, we obtained indepen- 
dently generated reporter mice from the Gene Expression Nervous 
System Atlas (GENSAT) consortium in which enhanced GFP 
(EGFP) expression is under the control of the Nur77 regulatory 
region’. The founders harboured two distinct insertion sites driving 
‘high’ or ‘low GFP expression. These were independently backcrossed 
to the C57BL/6 genetic background, yielding GFP™ and GFP"® lines. 

Basal expression of GEP in peripheral CD4* and CD8* T cells was 
higher in both the GFP" and GFP" lines compared to the reporter 
line described in ref. 6 (Supplementary Fig. 1a). Although basal GFP 
expression in B cells was substantially higher in the GFP” line relative 
to the reporter used in ref. 6, the GFP" line failed to express GFP in B 
cells, suggesting an isolated positional effect. For this reason, all sub- 
sequent B-cell studies have focused on the GFP™ reporter. After 


stimulation of thymocytes and peripheral T cells with phorbol 
myristate acetate (PMA) and/or ionomycin, GFP expression was 
rapidly induced (Supplementary Fig. 1b; data not shown). In vitro 
stimulation of either the TCR with anti-CD3 or the BCR with anti- 
IgM also induced GFP expression in a dose-dependent manner (Fig. la 
and Supplementary Fig. 1c; data not shown). GEP™' mice were crossed 
to the IgHEL BCR transgenic line (MD4; in which the immuno- 
globulin (Ig) receptor specifically recognizes hen egg lysozyme 
(HEL)) to generate mice with a monoclonal BCR repertoire. The 
resulting MD4-GFP mice showed dose-dependent GFP induction 
after treatment with HEL in vitro (Fig. 1b and Supplementary Fig. 1d). 

To define which antigen-receptor-induced biochemical pathways 
were required to drive GFP expression, we treated anti-CD3- and 
anti-IgM-stimulated lymphocytes with a range of small-molecule 
inhibitors in vitro. These experiments showed a nearly complete 
dependence on Src family kinases in T cells and Syk kinase in B cells 
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Figure 1 | The Nur77-GFP BAC transgenic reporter is responsive to 
antigen-receptor signalling in vitro. a, Histograms represent GFP and CD69 
expression of GFP'® transgenic lymph node (LN) T cells treated with varying 
doses of plate-bound anti-CD3e for 16h (0.00625-6.4 1g ml’ in a fourfold 
dilution series). b, Histograms represent GFP and CD69 expression in IgHEL 
GFP" transgenic lymph node B cells treated with varying doses of HEL for 16h 
(0.125-16 ng ml ' in a twofold dilution series). Data are representative of at 
least three independent experiments. 
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(Supplementary Fig. le, f). In B cells, GFP expression was partially 
dependent on the protein kinase C (PKC), calcineurin, mitogen- 
activated protein kinase (MAPK) and phosphatidylinositol-3-OH 
kinase (PI(3)K) pathways, whereas in T cells, GFP expression most 
clearly required PKC (Supplementary Fig. le, f). 

To define whether signals other than antigen-receptor ligation were 
sufficient to drive GFP expression in B cells, we treated GFP TB cells in 
vitro with various stimuli. Toll-like receptor (TLR)-4 and TLR9 
ligands, along with anti-CD40, could drive GFP expression in B cells, 
but this effect was considerably less robust than anti-IgM stimulation 
(Supplementary Fig. 1g). Notably, B-cell activating factor (BAFF) 
treatment with doses as high as 200ngml ', sufficient to induce 
prolonged B-cell survival in vitro, failed to induce GFP-reporter 
expression in B cells (Supplementary Fig. 1g). 

The reporter responded to TCR-dependent signalling in vivo, as 
shown by GFP expression at TCR-dependent checkpoints during 
thymic development. Signalling through the pre-T'CR, comprised of 
a recombined TCR-f chain and the invariant pre-TCR-« chain, drives 
developing thymocytes to transit the B-selection checkpoint. We 
observed abrupt upregulation of GFP expression at the “double- 
negative’ DN3b stage of development, precisely at the B-selection 
checkpoint transition (Supplementary Fig. 2a). 

After successful transit through the B-selection checkpoint, double- 
negative thymocytes upregulate the CD4 and CD8 coreceptors, and 
recombine the TCR-« chain to express a mature «TCR. These cells 
then undergo TCR-dependent positive or negative selection. We 
observed marked GFP upregulation in post-selection CD69" TCR- 
B'" ‘double-positive’ thymocytes (Supplementary Fig. 2b), as found 
in ref. 6. 

It has been speculated that, at the border of positive and negative 
selection, SP4* thymocytes can be rescued from death by adopting the 
regulatory T-cell (Tyg) fate. Indeed, CD25* SP4” thymocytes 
expressed much higher GFP levels than conventional SP4” thymo- 
cytes, indicating that strong TCR signalling favours the Tyg fate, in 
agreement with the results from ref. 6 (Supplementary Fig. 2c). 

We reported that titration of CD45 expression in an allelic series of 
mice regulates TCR signalling during thymic development’®. We 
crossed the GFP” reporter onto a genetic background harbouring 
two copies of the Lightning (L) CD45 (also known as Ptprc) allele, in 
which a point mutation in the extracellular domain leads to reduced 
surface expression of CD45 (15% of expression levels in wild-type 
mice)’. Both the fraction of high-GFP-expressing cells and the average 
GFP content of post-selection double-positive thymocytes was 
markedly reduced in so-called L/L GFP mice (Supplementary Fig. 2d). 
This result indicates that the GFP reporter is indeed sensitive to genetic 
titration of TCR signal strength. 

To identify analogous BCR-dependent signalling checkpoints 
during B-cell development, we assessed successive stages of bone 
marrow B-cell development in GFP" reporter mice"! (Fig. 2a and 
Supplementary Fig. 3a, b). We observed virtually no GFP expression 
except in the mature B cells that recirculate to the bone marrow (Hardy 
Fraction F; IgM'°IgD™), indicating that GFP upregulation occurs 
sometime after the early bone marrow stages of development, despite 
evidence of the contribution of antigen encounter to deletion and 
receptor editing in the bone marrow”. 

Splenic B-cell development, which follows maturation in the bone 
marrow, is subdivided into successive transitional stages'?’°. We 
observed a bimodal distribution of GFP expression among splenic B 
cells and found that early transitional B cells (T1) are largely GFP 
negative, but that later transitional stages (T2 and T3) contained a 
large proportion of GFP-positive B cells (Fig. 2b, c). Mature follicular 
B cells were mostly GFP positive and showed a broad distribution of 
GFP expression (Fig. 2c). Notably, a similar pattern of GFP expression, 
albeit at much lower levels, was evident in an independently generated 
GFP reporter® (Supplementary Fig. 3c). GFP expression across these 
splenic developmental stages inversely correlated with surface IgM 
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Figure 2 | Expression of the Nur77-GFP BAC transgenic reporter is 
upregulated at specific checkpoints during B-cell development. a, Left, plot 
of GFP™ transgenic bone marrow (BM) CD19" B cells stained for IgM and IgD 
to identify pre-B, immature (imm.), transitional (trans.) and mature 
recirculating (MR) subsets (counter-clockwise from bottom left corner). 
Middle, bone marrow subsets are colour coded. Right, overlaid histograms 
representing GFP expression in these subsets. b, Left, plot of GEP™™ transgenic 
splenic CD19* B cells stained for CD23 and CD21 expression to identify T1 
(CD23~ CD21), T2/Follicular (FO; CD23* CD21*) and marginal zone (MZ; 
cp21™) subsets. Middle, splenic B-cell subsets are colour coded. Right, 
overlaid histograms represent GFP expression in these subsets. c, Left, plot of 
GFP" transgenic splenic CD19" B cells, excluding marginal zone 
compartment, stained for CD23 and AA4.1 expression to identify T1 (AA4.1~ 
CD23), T2/3 (AA4.1* CD23") and follicular (AA4.1~ CD23") subsets. 
Middle, splenic B-cell subsets are colour coded. Right, overlaid histograms 
represent GFP expression in these subsets. d, Overlaid histograms represent 
GFP (left) and IgM (right) expression in T1, T2/3 and follicular subsets as 
identified in c. e, Left, T2/3 (AA4.1* CD23*) B-cell subset subdivided by IgM 
expression into T2 (IgM™) and T3 (IgM"°) stages. Right, overlaid histograms 
represent GFP expression T2 and T3 subsets. All data are representative of at 
least three independent experiments. 


expression (Fig. 2d). Transitional B-cell stages have previously 
been subdivided into T2 and T3 stages on the basis of surface IgM 
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downregulation’® (Fig. 2e). We observed that GFP upregulation seems 
to occur at precisely this transition between the T2 and T3 stages 
(Fig. 2e and Supplementary Fig. 3d, e). 

Interestingly, in vitro BCR stimulation of bone marrow and splenic 
B-cell subsets resulted in GFP upregulation to differing extents. Minimal 
GFP upregulation was seen in bone marrow immature and transitional 
stages, but robust upregulation was evident in splenic Tl, T2 and 
follicular B cells (Supplementary Fig. 4). This indicates that splenic, 
but not bone marrow, subsets have the capacity to upregulate GFP. 

To determine whether the amount of GFP expression in unstimu- 
lated B cells reflected BCR signal strength/antigen exposure, we took 
advantage of our previously characterized allelic series of CD45- 
expressing mice’®’*. In these animals, CD45 expression is genetically 
varied across a broad range and correlates with BCR signal strength’®. 
L/L mice with reduced surface expression of CD45 show impaired 
BCR signal transduction. So-called H/— mice express a normally 
splicing CD45 transgene superimposed on endogenous wild-type 
CD45 to produce an animal with supraphysiologic CD45 expression. 
B cells from these mice show enhanced BCR signal strength. After 
crossing the Nur77-GFP reporter mouse to the CD45 allelic series, 
we noted that GFP expression at the T1 stage was unaffected, whereas 
increasing CD45 expression resulted in a higher proportion of GFP- 
positive B cells at the T2 stage (Fig. 3a and Supplementary Fig. 5a). 
Notably, the distribution of GFP expression in this compartment 
remained bimodal, further supporting the notion that a discrete 
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Figure 3 | Expression of the Nur77-GFP BAC transgenic reporter is 
sensitive to genetic modulation of BCR signal strength and to antigen. 

a, CD45 allelic series (low to high CD45 expression: L/L, L/+, +/+ and H/—) 
GFP" transgenic splenic B cells were stained to identify B-cell subsets as in 
Fig. 2b, c. Overlaid histograms represent GFP expression in splenic subsets as 
gated in Supplementary Fig. 5a. b, CD45*/* GFP" transgenic and H/— GEP™™ 
transgenic splenic B cells with an unrestricted (no IgHEL transgene; IgHEL  ) 
or restricted (Ig HEL” ) repertoire in the absence of sHEL antigen were analysed 
as in a. Overlaid histograms represent GFP expression in splenic subsets as 
gated in Supplementary Fig. 5c. c, CD45*/* GFP! transgenic splenic B cells 
with an unrestricted or restricted repertoire in the presence or absence of SHEL 
antigen were analysed as in a. Overlaid histograms represent GFP expression in 
splenic subsets as gated in Supplementary Fig. 5d. All animals in these 
experiments were generated through genetic crosses. All data are representative 
of at least five independent experiments. 
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signalling event occurs at this stage, the threshold of which is regulated 
by CD45 and BCR signal strength. GFP expression in follicular mature 
B cells was markedly reduced in L/L mice, consistent with a reduction 
in BCR signal strength, but was minimally altered in H/— mice with 
higher CD45 expression (Fig. 3a and Supplementary Fig. 5b). 
However, modulation of GFP expression by CD45 was much more 
apparent in the marginal zone compartment, suggesting an exquisite 
sensitivity to BCR signal strength (Fig. 3a and Supplementary Fig. 5b). 

As Nur77-GFP expression is regulated by modulation of BCR 
signal strength (Fig. 3a and Supplementary Fig. 5b), we proposed that 
endogenous antigen exposure might drive BCR signalling during 
maturation of wild-type B cells with a diverse repertoire. To explore 
this possibility, we took advantage of the IgHEL/soluble (s)HEL 
double-transgenic system (MD4/MLS5), in which MD4 mice with a 
monoclonal IgHEL BCR can be studied in the presence or absence 
of sHEL’. In the Nur77-GFP reporter mice with the IgHEL BCR 
transgene-restricted repertoire in the absence of antigen, we observed 
a marked reduction in GFP in splenic B cells (Fig. 3b and 
Supplementary Fig. 5b, c). Notably, the bimodal distribution of GFP 
expression observed in the context of a wild-type repertoire was lost in 
these mice. Further increasing CD45 expression in the context of such 
a restricted repertoire to increase tonic BCR signalling resulted in 
increasing GFP expression, but again only in a unimodal rather than 
a bimodal distribution (Fig. 3b and Supplementary Fig. 5c). Finally, the 
introduction of sHEL ligand by crossing ML5 (sHEL trangenic) mice 
to IgHEL transgenic reporter mice resulted in increased GFP expres- 
sion as expected, and remarkably reconstituted bimodal GFP expres- 
sion in the transitional splenic stages of development (Fig. 3c and 
Supplementary Figs 5d and 6). These data indicate that normal 
B-cell development is characterized by a wide range of antigen experi- 
ence, and that Nur77-driven GFP distribution in follicular mature B 
cells serves as a marker of such exposure. 

To determine whether antigen recognition during splenic B-cell 
development had functional effects on signalling, we selectively gated 
on the extremes of GFP expression. We observed that high-GFP- 
expressing B cells had dampened 40s ribosomal protein S6 (RPS6) 
phosphorylation (a PI(3)K-dependent event) and calcium entry 
relative to low-GFP-expressing B cells in response to IgM ligation 
(Supplementary Figs 4a and 7a). Moreover, we observed that basal cal- 
cium levels were elevated in high-GFP-expressing B cells, reminiscent of 
anergic B cells identified in various model BCR transgenic systems’”"*. 
Dampened inducible signalling and increased basal calcium were not 
isolated properties of very-high-GFP-expressing B cells, but rather 
seemed to represent continuous functional properties across the entire 
spectrum of GFP expression of mature follicular B cells (Fig. 4a). 
Furthermore, restricting the BCR repertoire in the absence of ligand 
ablated differences in functional responsiveness, but not in basal calcium 
(Supplementary Fig. 7b). Notably, neither inducible calcium responses 
nor basal calcium levels correlated with GFP expression in naive CD25* 
CD4* T cells, indicating that only in B cells does antigen exposure tune 
functional responsiveness (Supplementary Fig. 8a). 

Mature B cells express two isotypes of the BCR, IgM and IgD. We 
wanted to determine whether the functional responsiveness in GFP B 
cells was modulated in response to stimulation through the IgD BCR in 
the same manner as it is to the IgM BCR. We found that this was not 
the case (Fig. 4b, c); responsiveness to IgM BCR stimulation was 
markedly blunted in cells with high GFP expression, whereas IgD 
responsiveness remained intact. Stimulation with anti-« antibodies 
to ligate both surface IgM and IgD resembled isolated IgD stimulation 
(Supplementary Fig. 8b). By simultaneously staining for surface IgM 
expression with a nonstimulatory monovalent Fab fragment and asses- 
sing calcium responses in GFP B cells, we show that differences in 
surface IgM expression largely accounted for the functional differences 
at different levels of GFP expression (Supplementary Fig. 8c). 
However, basal calcium differences were independent of surface IgM 
expression (Supplementary Fig. 8d). 
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Figure 4 | GFP expression predicts functional responsiveness and 
autoreactivity of B cells. a, Left, GFP! transgenic follicular (CD23* AA4.1_) 
splenic B cells were subdivided into colour-coded bins on the basis of GFP 
expression. Right, GFP" transgenic splenic B cells were loaded with Indo-1 dye 
and stimulated with 10 1g ml * anti-IgM. Ratiometric assessment of 
intracellular calcium was carried out by flow cytometry. Upper right panel 
represents calcium entry in total follicular splenic B cells. Lower right panel 
represents basal and inducible intracellular calcium in GFP-specific bins. IC, 
intracellular. b, c, Intracellular calcium entry was assessed in GFP! transgenic 
follicular splenic B cells after anti-IgM (b) or anti-IgD (c) stimulation at varying 
doses. High and low GFP-expressing gates are overlaid. Data in a—c are 
representative of at least three independent experiments. d, Overlaid 
histograms represent pre- and post-sort follicular mature CD23* AA4.1~ B 
cells as gated in Supplementary Fig. 9. The 15% lowest and highest GFP 
fractions from the follicular mature B-cell compartment were selected for 
sorting. e, High-expressing-GFP and low-expressing-GFP B cells sorted as 
described in c and Supplementary Fig. 9a were stimulated in vitro with LPS for 4 
days. Supernatants were subjected to ANA IgM enzyme-linked 
immunosorbent assays (ELISA) and total IgM ELISA. The graph represents 
quantification of ANA IgM normalized to total IgM from four independent 
sorting experiments. Data are + s.e.m. AEU, arbitrary ELISA units. 
Significance was assessed by unpaired t-test. **P < 0.005. 


Additional characteristics of monoclonal BCR transgenic models of 
anergic B cells include a failure to upregulate activation markers in 
response to various stimuli*. We stimulated sorted high- and low- 
GFP-expressing B cells and observed that activation marker upregula- 
tion in response to IgM stimulation is impaired in high-GFP- 
expressing B cells (Supplementary Figs 9, 10). Importantly, responses 
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to lipopolysaccharide (LPS) and CD40 were unaffected, as was in vitro 
survival in the presence or absence of BAFF (data not shown). 

Finally, to determine directly whether the BCR repertoire of mature 
B cells with high GFP expression and impaired functional responses 
was indeed autoreactive, sorted high- and low-GFP-expressing B cells 
were polyclonally stimulated in vitro with LPS, and secreted antibody 
was assessed for anti-nuclear antibody (ANA) reactivity. (Fig. 4d, e and 
Supplementary Fig. 9). Notably, neither cell proliferation nor antibody 
secretion following LPS stimulation were impaired in high-GFP B cells 
(data not shown). We found a significant increase in ANA reactivity, 
suggesting auto- or polyreactivity in the repertoire of such naturally 
occurring anergic B cells (Fig. 4d, e). 

The human B-cell repertoire is characterized by a high prevalence 
of polyreactive and autoreactive BCRs’’’. Anergy or functional 
unresponsiveness may serve to keep such autoreactive clones in check’. 
Array data have shown that wild-type B cells have an intermediate 
phenotype between antigen-naive and anergic B cells, suggesting the 
possible presence of anergic B cells in the wild-type mature rep- 
ertoire®’". It has recently been argued that the so-called T3 splenic 
subset may in fact represent sequestered anergic B cells rather than 
an intermediate developmental stage**”*. However, the prevalence of 
anergy in the normal mature B-cell repertoire has not been clear*’. We 
show that there is a continuum of anergy or unresponsiveness to anti- 
IgM stimulation in the mature B-cell compartment, and that this 
responsiveness is, in turn, tuned by developmental antigen recognition. 

It has long been observed that marked IgM downregulation is seen 
in BCR transgenic systems in the presence of either antigen or 
enhanced BCR signal strength*’*’’*°, IgD, by contrast, remains 
relatively unmodulated in these systems. Here, we show that, in the 
wild-type B-cell repertoire, IgM downregulation correlates with the 
extent of antigen recognition during development and accounts for 
dampened B-cell responses to anti-IgM stimulation, whereas IgD 
expression and responses are intact. We suggest that this constitutes 
a general mechanism to modulate BCR signalling in autoreactive B 
cells, but permits them to persist as a pool of extended antibody spe- 
cificity for purposes of protective immunity. Indeed, we demonstrate 
an increased proportion of ANA-reactive BCR specificities in high- 
GFP-expressing B cells, suggesting that these cells are auto- or poly- 
reactive. It is tempting to speculate that this large reservoir of dormant 
autoreactive B cells in the mature BCR repertoire may serve as the 
source of pathogenic autoantibodies that characterize rheumatic 
diseases such as systemic lupus erythematosus. 


METHODS SUMMARY 


The following mouse strains have been previously described: the CD45 allelic 
series including Lightning (L/L), H/— (HE) mice'®!*”’, IgHEL (MD4) and sHEL 
(ML5) mice’. Nur77-EGFP BAC transgenic mice were obtained from the 
GENSAT consortium’. Nur77-GFP reporter mice described in ref. 6 were sup- 
plied by the Hogquist laboratory. All strains were backcrossed to the C57BL/6 
genetic background at least six generations and were maintained in the University 
of California, San Francisco animal facility in accordance with institutional reg- 
ulations. In vitro lymphocyte-stimulation assays were performed as previously 
described on plates containing either soluble anti-IgM Fab’2, precoated with 
anti-CD3¢, and/or containing various stimuli and inhibitors’*. Calcium assays 
were performed as previously described”*, except that Indo-1 dye (Invitrogen) 
was used to load cells, and an ultraviolet laser on the BD Fortessa was used for 
detection. Intracellular phospho-S6 staining and stimulation was performed as 
previously described'®. Sorting of GFP-high and -low-expressing B cells using a 
MofFlo cell sorter was performed as follows: splenic and lymph node cells were 
pooled and stained to identify DAPI (4’,6-diamidino-2-phenylindole)-CD23* 
AA4.1 mature B cells. The highest and lowest 15% of GFP-expressing B cells were 
retrieved and were incubated with varying stimuli. Sorted cells were plated at a 
concentration of 1.5 X 10° cells per ml in complete DMEM media and were sti- 
mulated with anti-IgM Fab’2 at varying doses for 16h to assess activation marker 
upregulation. Alternatively, sorted cells were incubated with 10g ml~’ LPS at a 
concentration of 6 X 10° cells per ml in complete DMEM to drive polyclonal 
antibody secretion. Supernatants were then collected and subjected to enzyme- 
linked immunosorbent assay (ELISA). The assay to detect total IgM was 
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performed as previously described”. The ANA ELISA kit obtained from Inova Inc. 
was used as per manufacturer’s instructions. 


Full Methods and any associated references are available in the online version of 
the paper. 
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METHODS 

Mice. The CD45 allelic series including Lightning (L/L) and H/H, H/— (HE) mice 
have been previously described'®'®”’, as have IgHEL (MD4) and sHEL (MLS) 
mice’. Nur77-EGFP BAC transgenic mice were obtained from the GENSAT 
consortium’. Nur77-GFP reporter mice described in ref. 6 were supplied by the 
Hogquist laboratory. All strains were backcrossed to the C57BL/6 genetic back- 
ground at least six generations. Mice were used for all functional and biochemical 
experiments at age 5-9 weeks. All mice were housed in a specific pathogen-free 
facility at University of California, San Francisco in accordance with the 
University’s Animal Care Committee and National Institutes of Health guidelines. 
Antibodies and other reagents. The following antibodies were used: antibodies to 
murine CD1d, CD4, CD5, CD8, CD11b, CD11c, CD19, CD21, CD23, CD24, 
CD25, CD43, CD44, CD69, CD93 (AA4.1), BP-1, IgD, IgM, pNK, y6TCR and 
TCR-f were conjugated to fluorescein isothiocyanate (FITC), phycoerythrin (PE), 
peridinin chlorophyll protein complex (PerCP)-Cy5.5, PE-Cy5.5, PE-Cy7, Pacific 
blue, allophycocyanin (APC) or Alexa647 for fluorescence-activated cell sorting 
(FACS) staining (eBiosciences or BD Biosciences), phospho-S6 Alexa488 (2F9) 
antibody for intracellular staining, unconjugated CD3¢ (2C11) antibody (Harlan), 
goat anti-Armenian hamster immunoglobulin (H+L), goat anti-mouse IgM 
Fab’2 for stimulation and Fab fragment coupled to Alexa647 for surface staining 
(Jackson Immunoresearch), biotinylated anti-IgD (BD Biosciences) streptavidin 
(Sigma), mouse IgM-UNLB, mouse IgH+L-UNLB, goat anti-mouse IgM biotin 
and streptavidin-horseradish peroxidise (HRP) for ELISA, and goat anti-mouse k 
for stimulation (Southern Biotech). Inhibitors and stimuli include ionomycin 
1M, PKC inhibitors (Go-6983 40nM and Ro-32-0432 40nM), Bay 61-3606 
10 4M, cyclosporine A 20 1M, PP2 20 uM, Ly-294002 20 nM (Calbiochem), phor- 
bol myristate acetate (PMA) 0.02 pg ml P cycloheximide (CHX) 10 pg ml~ 1 LPS 
50 pg ml |, HEL (Sigma), U0126 10 11M (Cell signaling), CpG 2 1M (InvivoGen), 
anti-CD40 1 pgml* (BD) and BAFF 200ng ml’ (R&D). 

Flow cytometry and data analysis. Cells were stained with antibodies of the 
indicated specificities and analysed on a FACSCalibur or Fortessa (BD 
Biosciences) as described previously”. Data analysis was performed using 
FlowJo v8.8.4 (Treestar). Statistical analysis and graphs were generated using 
Prism v4c (GraphPad Software). 
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In vitro lymphocyte stimulation (+/— inhibitor). Single cell suspensions of 
lymphocytes were plated at a concentration of 1.5 X 10° cells per ml in complete 
DMEM and were incubated in the presence of various stimuli and/or inhibitors at 
the doses described above for 16 h. Assays were performed as previously described’*. 
Calcium measurements. Assays were performed as previously described”, except 
that Indo-1 dye (Invitrogen) was used to load cells, and an ultraviolet laser on the 
BD Fortessa was used for detection. Before stimulation and analysis, splenocytes 
were surface stained for expression of CD23 and AA4.1 to identify B-cell subsets. 
Where noted, cells were also pre-stained with anti-IgM Fab fragments to identify 
surface IgM expression without inducing BCR stimulation. Stimulation was 
carried out using either varying doses of anti-IgM Fab’2, anti-« antibody or 
biotinylated anti-IgD followed by streptavidin crosslinking (15 jig ml), or vary- 
ing doses of anti-CD3é followed by goat anti- Armenian hamster immunoglobulin 
crosslinking (50 pg ml‘). 

Intracellular phospho-S6 staining. Staining and stimulation was performed as 
previously described'’. 

B-cell sorting and stimulation. GFP-high- and -low-expressing B cells were sorted 
using a MoFlo cell sorter. Splenic and lymph node cells were pooled and stained for 
CD23 and AA4.1 as well as DAPI (4’,6-diamidino-2-phenylindole) to identify 
CD23* AA4.1~ mature B cells. The 15% highest and lowest GFP-expressing B cells 
were retrieved and incubated with varying stimuli. Sorted cells were plated at a 
concentration of 1.5 X 10° cells per ml in complete DMEM and were stimulated 
with anti-IgM Fab'2 at varying doses for 16h. Cells were then stained for CD69 
expression in order to assess activation-marker upregulation. Alternatively, sorted 
cells were incubated with 10 jg ml’ LPS at a concentration of 6 X 10° cells per ml in 
complete DMEM media in order to drive polyclonal antibody secretion. 
Supernatants were then collected and subjected to ELISA. 

ELISA. The ELISA to detect total IgM was performed as previously described”. 
The ANA ELISA kit obtained from Inova, Inc. was used as per manufacturer’s 
instructions. Biotinylated anti-mouse IgM (1:5,000) and streptavidin-HRP con- 
jugate (1:4,000) were used for detection in both assays for signal amplification 
(Southern Biotech), and slow kinetic tetramethylbenzidine (Sigma) was used as 
substrate. Molecular devices SpectraMax and SoftMax Pro software were used to 
read plates. ANA IgM quantification was normalized to total IgM for each sample. 
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